Thesis excerpt-infobright

Last Update:2014-08-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

Brighthouse: ananalytic data warehouse for ad-hoc queries, vldb 2008

Brighthouse is a column-oriented data warehouse. In terms of column storage and data compression, the data compression ratio reaches 10:1. Its core knowledge grid (knowledge grid) layer is a metadata layer capable of automatic adjustment and storage, replacing the indexing function, it provides data filtering, statistical information expression, and actual data location information, so that brighthouse can be used as an analytical data warehouse to achieve ad-hoc query speed. The knodge DGE grid layer is involved in the query optimization and execution phase to reduce the data read volume and decompression overhead.

The brighthouse in this paper is the commercial data warehouse infobright.

Introduction

The column-Oriented Architecture is more suitable for analytical data warehouses and the row-Oriented Architecture is more suitable for OLTP systems. Brighthouse is column-oriented.

Knowledge gird is a data about data concept. It provides statistical information similar to the data to help obtain the desired data during query optimization and execution. This is the core part of the brighthouse design. At a level, it is between the query optimization, execution layer, and data (compression) storage layer. In addition, the metadata stored in the knowledge grid is very small and can be stored in the memory.

The knowledge grid is composed of knodge DGE nodes. Each node records the metadata information of the compressed data. The actual data is stored in data packs, and the data volume is large, which is saved by column, partition is not performed (this part of information is maintained by the knowledge node) and is compressed. Therefore, data packs represents the data storage model of brighthouse, while knodge DGE grid is similar to the metadata layer.

Architecture and module

Architecture diagram:

The gray part is the original module of MySQL, and the white and blue parts are the ones of infobright.

Like MySQL, the preceding logic layer processes query logic, and the following is the storage engine.

This figure mainly describes the data import and export, datapack, knowledge grid, optimization, and execution.

The loader and unloader on the right side of the logic layer are the data import and export modules of infobright and are independent services.

The bottom layer of the storage layer is data pack. Each pack contains 64 K Elements of a column, and all data is packaged and stored in this form. datapack adopts different compression algorithms based on different data types, with a high compression ratio.

The knodge DGE grid contains two types of nodes:

Each data pack node corresponds to a data pack, storing statistics such as the number of Min, Max, AVG, null, and total number;

Knowledge node stores more advanced statistics and connection information with other tables. Some of the information here is well calculated during data loading and some are computed with queries, so it is automated.

The knowledge grid also contains such data information.

1. histograms (hists): a bar chart created for numeric columns. It is saved in binary format.

2. Character maps (cmaps): information created for letter columns, such as the occurrence of each letter in string.

3. Pack-to-packs. This part is prepared for join operations and is associated with two column values under a condition of two tables.

Application,

Hists is suitable for between statements, because the bar chart expresses the largest and smallest information, and the information in the range is classified.

Cmaps is suitable for like statements because it is related to letters.

Pack-to-packs is suitable for join operations and provides the row numbers of tables suitable for join operations.

The above roughly describes several pieces of data statistics and the applicable scenarios.

In terms of query optimization and execution, we refer to the rough set idea to design and divide the data into three types: correlation, no correlation, and suspicion. The three types correspond to positive region, negative region, and boundary region.

Summary

As an open-source Mysql Data Warehouse solution, infobright introduces the column storage solution, high-strength data compression, and optimized statistical computing, this article excerpted the most important design point in the infobright paper. knowledgegird is the core of infobright design.

Full Text :)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More