Background
Brighthouse: ananalytic data warehouse for ad-hoc queries, vldb 2008
Brighthouse is a column-oriented data warehouse. In terms of column storage and data compression, the data compression ratio reaches 10:1. Its core knowledge grid (knowledge grid) layer is a metadata layer capable of automatic adjustment and storage, replacing the indexing function, it provides data filtering, statistical information expression, and actual data location information, so that brighthouse can be used as an analytical data warehouse to achieve ad-hoc query speed. The knodge DGE grid layer is involved in the query optimization and execution phase to reduce the data read volume and decompression overhead.
The brighthouse in this paper is the commercial data warehouse infobright.
Introduction
The column-Oriented Architecture is more suitable for analytical data warehouses and the row-Oriented Architecture is more suitable for OLTP systems. Brighthouse is column-oriented.
Knowledge gird is a data about data concept. It provides statistical information similar to the data to help obtain the desired data during query optimization and execution. This is the core part of the brighthouse design. At a level, it is between the query optimization, execution layer, and data (compression) storage layer. In addition, the metadata stored in the knowledge grid is very small and can be stored in the memory.
The knowledge grid is composed of knodge DGE nodes. Each node records the metadata information of the compressed data. The actual data is stored in data packs, and the data volume is large, which is saved by column, partition is not performed (this part of information is maintained by the knowledge node) and is compressed. Therefore, data packs represents the data storage model of brighthouse, while knodge DGE grid is similar to the metadata layer.
Architecture and module
Architecture diagram:
The gray part is the original module of MySQL, and the white and blue parts are the ones of infobright.
Like MySQL, the preceding logic layer processes query logic, and the following is the storage engine.
This figure mainly describes the data import and export, datapack, knowledge grid, optimization, and execution.
The loader and unloader on the right side of the logic layer are the data import and export modules of infobright and are independent services.
The bottom layer of the storage layer is data pack. Each pack contains 64 K Elements of a column, and all data is packaged and stored in this form. datapack adopts different compression algorithms based on different data types, with a high compression ratio.
The knodge DGE grid contains two types of nodes:
Each data pack node corresponds to a data pack, storing statistics such as the number of Min, Max, AVG, null, and total number;
Knowledge node stores more advanced statistics and connection information with other tables. Some of the information here is well calculated during data loading and some are computed with queries, so it is automated.
The knowledge grid also contains such data information.
1. histograms (hists): a bar chart created for numeric columns. It is saved in binary format.
2. Character maps (cmaps): information created for letter columns, such as the occurrence of each letter in string.
3. Pack-to-packs. This part is prepared for join operations and is associated with two column values under a condition of two tables.
Application,
Hists is suitable for between statements, because the bar chart expresses the largest and smallest information, and the information in the range is classified.
Cmaps is suitable for like statements because it is related to letters.
Pack-to-packs is suitable for join operations and provides the row numbers of tables suitable for join operations.
The above roughly describes several pieces of data statistics and the applicable scenarios.
In terms of query optimization and execution, we refer to the rough set idea to design and divide the data into three types: correlation, no correlation, and suspicion. The three types correspond to positive region, negative region, and boundary region.
Summary
As an open-source Mysql Data Warehouse solution, infobright introduces the column storage solution, high-strength data compression, and optimized statistical computing, this article excerpted the most important design point in the infobright paper. knowledgegird is the core of infobright design.
Full Text :)