Open-source MySQL efficient data warehouse solution: Infobright details, mysqlinfobright

Last Update:2015-03-21 Source: Internet

Author: User

Tags mysql functions mysql import

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Open-source MySQL efficient data warehouse solution: Infobright details, mysqlinfobright

Infobright is a column-based database based on unique patented knowledge grid technology. Infobright is an open-source MySQL Data Warehouse solution that introduces the column storage solution, high-strength data compression, and optimized Statistical Computing (such as sum/avg/group ), infobright is based on mysql, but it does not support mysql, because it comes with. Mysql can be roughly divided into the logic layer and the physical storage engine. infobright mainly implements a storage engine, but because its own storage logic is basically different from that of relational databases, it cannot be directly attached to mysql as the plug-in InnoDB. Its logic layer is the logic layer of mysql and its own optimizer.

InfobrightFeatures

Advantages:

Query performance of large data volumes is strong and stable: The same SELECT query statement under the conditions of millions, tens of millions, and hundreds of millions of records is faster than ordinary MySQL storage engines such as MyISAM and InnoDB ~ 60 times. Efficient query mainly relies on the specially designed storage structure to optimize the query. However, the optimization effect depends on the database structure and query statement design.
Large data volumes: terabytes of data and billions of records. Data volume storage depends on your own high-speed data loading tool (GB/hour) and high data compression ratio (>)
High data compression ratio: It is said that the average data compression rate can reach 10 to 1. It can even reach 40: 1, greatly saving data storage space. High data compression ratios mainly rely on column-based storage and patent-pending flexible compression algorithms.
Column-based storage: No index or partition is required. Even if the data volume is huge, the query speed is fast. It is used in a data warehouse and cannot be used to process a large amount of data. You do not need to create an index to avoid the problem of maintaining indexes and indexes expanding with data. Data in each column is compressed and stored in blocks. Each knowledge grid node records the statistical information in the block, instead of indexing, to accelerate search.
Quick response to complex aggregate queries: Suitable for complex analytical SQL queries, such as SUM, COUNT, AVG, and GROUP

InfobrightValue

Save design costs. No complex data warehouse model design requirements (such as star model and snowflake model), no materialized view, Data Partition, and index creation required
Save storage resources. The high compression ratio is usually, and some applications may reach 40: 1
Integration is widely used. Compatible with many BI suites, such as Pentaho, Cognos, and Jaspersof
Reduce O & M costs. As the database grows, the query and loading performance remains stable, and the implementation and management are simple, requiring minimal management.
Commercial guarantee. The first open-source warehouse analysis database supported by the business is the officially recommended warehouse integration architecture of Oracle/MySQL.

InfobrightApplicable scenarios

Big Data analysis applications. Webpage/online analysis, mobile analysis, customer behavior analysis, analysis marketing and advertising
Log/event management system. China Telecom detailed ticket analysis and report, system/network security certification records
Data mart. Specific data warehouses of enterprises and institutions, providing data warehouses for Small and Medium-sized Enterprises
Embedded analysis. Provides Embedded analysis applications for independent software vendors/SaaS providers

Restrictions:

Data update Is Not supported: for the Community edition, Infobright can only import DATA using the "load data infile" method, but does not support INSERT, UPDATE, and DELETE. This makes it very difficult to modify the data, thus limiting its use as a data warehouse for real-time data services.
High concurrency is not supported: only more than 10 concurrent queries are supported. Although more than 10 concurrent queries in a single database are sufficient for general applications, however, low machine utilization is always a bad thing for investors, especially when there are many concurrent requests.
Master-slave backup and horizontal scaling are not provided. If you do not have a master-slave backup and want to back up the data, you can load the data at the same time, but you can only verify the final data consistency, so that the slave machine stops service for a long time during data loading; in terms of horizontal scaling, it is not a distributed storage system.

Comparison with MySQL

Infobright is suitable for Data Warehouse scenarios: Non-transaction, non-real-time, non-Multi-concurrency; analysis-oriented; stores established facts, such as logs, or a large amount of summarized data. Therefore, it is not suitable for responding to requests from website users. In fact, it takes much longer to retrieve a record than mysql, but it takes records faster than mysql.
Mysql usually occupies more space than the actual data file because it also has an index. The compression capability of infobright is very powerful, and different types of data are compressed by column.
The service format is the same as that of the interface mysql. You can enable the infobright service in a way similar to mysql, and then the applications connected to mysql can connect and query infobright in a similar way. This is a good news for skilled mysql users, and the learning cost is basically 0.

Infobright has two release editions: open-source ICE and closed-source commercial IEE. ICE provides sufficient functions, but cannot INSERT, DELETE, UPDATE, and only load data infile. In addition to providing more adequate functions, IEE is said to be faster in query.

The Community ICE version has been tested by major domestic enterprises and has a low investment in generating systems, mainly due to the following reasons:

Limits on DML and alter statements
Incremental load export and import at regular intervals
The built-in MyISAM is difficult to support high concurrency. to make full use of server resources, you must enable another MySQL instance.
Poor Support for Chinese and other multi-byte characters
Only single-core scheduling is supported.
Lack of original factory support

Differences between ICE and IEE

IEE includes additional features tailored to the work needs of most enterprises, such as better query performance, DML statement support, and distributed import. In addition, the IEE version also contains a certain level of Infobright factory or agent support rescue services, product training, and so on.

Significant query performance differences. Although IEE and ICE both have significantly higher query performance than Oracle, SQL Server, MySQL, and other row-based databases, IEE is 50-500% faster than ICE. This obvious gap comes from the unique multi-thread scheduling module in the IEE Core Engine (introduced from IEE3.5 ). in ICE, an independent query can only use a single CPU core, while other query processes can only use other cores. For complex queries that require filtering and distinguishing large amounts of data, the use of the IEE multi-thread scheduling module can significantly save query time.
DML statements are supported. IEE supports standard SQL data operation languages and uses insert, update, and delete statements to manipulate data. However, ICE only supports Load data infile for data import, and all data needs to be re-imported for any data changes. Using DML statements reduces data query performance and increases with the number of times.
Supports DDL statements. Including alter table rename, add column, and drop column (but the column operation can only take effect for the last column)
Support for Hadoop APIs (through DLP)
Advanced Replication and high availability. The IEE version includes the Master/Slave function and is based on SQL statement.
Easier import and faster import speed. IEE supports the distributed import tool-DLP. It also includes standard MySQL native loader for processing the import of some complex data. On the other hand, it also shows that IBloader has poor fault tolerance.
Simultaneous Load or DML consistent Query
Support temporary tables
Other Commercial licenses, after-sales support, etc.

Architecture

MySQL-based internal architecture-Infobright adopts a similar internal architecture as MySQL. The following figure shows the architecture of Infobright:

The gray part is the original module of mysql, and the white and blue parts are the ones of infobright.

Infobright has the same two-layer structure as mysql:

Logic layer: process query logic (service and application management). loader and unloader on the right side of the logic layer are the DATA import and export modules of infobright, that is, load data infile... And SELECT... The into file task is oriented to the massive data environment. Therefore, the data import and export module is an independent service and is not directly used by the mysql module. The infobright optimizer package of the logic layer is outside the mysql query optimizer, as will be mentioned below, because its storage layer has some special structures, therefore, the query optimization method is significantly different from that of mysql.
Storage engine: the default storage engine of Infobright is brighthouse. However, Infobright supports other storage engines, such as MyISAM, MRG_MyISAM, Memory, and CSV. Infobright organizes Data through three layers: DP (Data Pack), RPA (Data Pack Node), and KN (Knowledge Node ). On this layer, it is a very powerful Knowledge Grid ).

Infobright Module

Optimizer. Minimize the amount of decompressed data to effectively improve the execution plan.
Knowledge Grid. Stores metadata, column information, table relationships, data block distribution status statistics, cache information of the same query status
Data Pack Data blocks. Real data compression storage location, stored by data storage Block

Data Pack(Data block) compression layer

The bottom layer of the storage engine is a Data Pack (Data block ). Each Pack contains 64 K Elements of a column, and all data is packaged and stored in this form, each data block is compressed by type (different compression algorithms are used based on different data types), with a high compression ratio. The upper-layer compressor and the decompression tool do this.

Infobright claims that the data compression ratio is 10:1 to 40: 1. We have mentioned earlier that the compression of Infobright is based on the data type in the DP. The system automatically selects the compression algorithm and automatically adjusts the algorithm parameters to achieve the optimal compression ratio. Take a look at the compression ratio in the experiment environment, as shown in:

The overall compression ratio is 20.302. However, there is a misunderstanding that the compression ratio here refers to the size of the original data in the database/the size of the compressed data, rather than the physical data size of the text file/the size of the compressed data. Obviously, the former is much larger than the latter. In my experiment environment, the latter is around. Generally, the size of text data stored in the database is much larger than that of the original text, because some fields are set to a fixed length and occupy more space than the actual size. There is also a lot of statistical information data in the database, including indexes, which occupy a large amount of space. Infobright does not have an index, but it has KN data. Generally, the KN data size accounts for about 1% of the total data size.

Since Infobright compresses data types, let's look at the compression ratio of different data types. See the following table:

First look at the compression ratio of the Int type, the result is the compression ratio of Int <mediumint <smallint. Careful readers will easily find out how the compression ratio of tinyint is smaller than that of int. In addition to the data type, the data compression ratio is significantly different from the data type. PosFlag only has three possibilities: 0, 1, and-1. This data obviously cannot achieve a good compression ratio.

Let's look at the act field. The act field uses comment lookup, which has a better compression ratio and query performance than the simple char type. The principle of comment lookup is actually like bitmap index. The usage of comment lookup will be detailed in the next chapter. Among all the fields, the compression ratio of the date field is the highest, and the final data size is only 0.1 M. The compression ratio of varchar is relatively poor, so we do not recommend using varchar unless necessary.

The above data clearly shows the powerful compression performance of Infobright. Once again, it is emphasized that data compression is not only related to data types, but also plays a significant role in the degree of data difference. When selecting the field data type, I personally think that performance considerations should be placed first. For example, the selection of some fields in the preceding table can be optimized. The ip can be changed to the bigint type, and the date can even be split into the year, month, or day columns as needed.

Knowledge Grid(Knowledge grid)

The top-up of the compression layer is the most important concept of infobright: knodge DGE Grid (Knowledge Grid), which is also the basis for infobright to discard indexes but can be applied to massive data queries. The knodge DGE Grid architecture is an important reason for the high performance of Infobright. It contains two types of nodes:

Data Pack Node: Data Pack Node corresponds to Data Pack Node. RNS records statistics stored and compressed in each DP, including the maximum value (max), minimum value (min), number of null, count of the total number of units, and sum. Avg and so on. To the amount of different values, etc.; knodge DGE Node stores some more advanced statistics and connection information with other tables. Some of the information here is well calculated during data loading, some are computed along with the query, so it is "intelligent.
Knowledge Node stores metadata sets that point to the relationship between DP or columns, such as the value range (MIin_Max) and column data association. Most of the KN data is generated when loading data, and other tasks are generated when querying data.

The knodge DGE Grid can be divided into four parts, such as DSP, Histogram, CMAP and P-2-P.

As mentioned above.

Histogram is used to improve the query performance of numeric types (such as date, time, decimal. Histogram is generated when data is loaded. There are mix, max, and in Histogram, Min-Max is divided into 1024 segments. If the range of Mix_Max is smaller than 1024, each segment is a separate value. In this case, KN indicates whether a value is in the binary representation of the current segment.

Histogram is used to quickly determine whether the current DP meets the query conditions. As shown in, for example, select id from customerInfo where id> 50 and id <70. It is easy to obtain that the current DP does not meet the conditions. Therefore, Histogram can effectively reduce the number of query DP for those numeric queries.

CMAP is used for text query and is generated when data is loaded. CMAP is used to count the situation where the ASCII value in the current DP is 1-64. As shown in

For example, the figure above shows that A has never appeared in the second, third, and fourth positions of the text. 0 indicates no, and 1 indicates yes. The comparison of the text in the query is based on bytes. Therefore, CMAP can improve the performance of the text query.

Pack-To-Pack is generated during the Join operation. It indicates the bitmap of the relationship between the two columns operated in the join two DP, that is, the binary matrix.

Stored in memory, with the scope in a Sission
Improve JOIN query performance, whether new or reused

Rough Sets is one of the core technologies of Infobright. During query execution, Infobright divides DP into three types based on the Knowledge Grid:

Related DP (Relevant Packs), which meets the query Conditions
Irrelevant Packs, which does not meet the query Conditions
Suspicious DP (Suspect Packs). The data in DP meets the query conditions.

Case:

Copy codeThe Code is as follows:
Select count (*) FROM employees WHERE salary> 100000 AND age <35 AND job = 'it' AND city = 'san Mateo ';

Search for data packets containing salary> 100000
Search for packets containing age <35
Search for packets containing job = 'it'
Search for data packets containing city = 'san Mateo'
Remove all tags irrelevant to search criteria
Finally, extract the relevant data in the specified data packet.
Perform search

From the above analysis, we can know that Infobright can execute some queries very efficiently, and the higher the discrimination of where statements, the better. The high where discrimination can more accurately confirm whether it is related to DP, irrelevant DP, or DP, minimizing the number of DP and reducing the performance loss caused by decompression. The Histogram and CMAP mentioned in the previous chapter are generally used for condition determination, which can effectively improve query performance. The principle of multi-table join is similar. First, we use Pack-To-Pack To generate the relationship between the two columns of join DP. For example, select max (X. D) from t join x on t. B = X.C WHERE T. A> 6. Pack-To-Pack generates the relational matrix M between the DP of T. B and X.C. Suppose there is an element crossover between the first DP of T. B and the first DP of X.C, then M [] = 1, otherwise M [] = 0. This effectively reduces the number of DP during join operations. We have learned about compression earlier. By the way, we mention the compression of DP. 64 k elements in each DP are treated as a sequence, where all null locations are stored separately, and other non-null data is compressed. Data Compression is related to the data type. infobright selects a compression algorithm based on the data type. Infobright automatically adjusts algorithm parameters to achieve optimal compression ratio.

Knowledge Grid is still complex. There are many details in it. for details, refer to the official White Paper and Brighthouse: an analytic data warehouse for ad-hoc queries.

Comment lookupUse

We have analyzed the Infobright architecture and briefly introduced the compression process and working principle of Infobright. Now we will discuss the issue of query optimization.

1) configuration environment: in Linux, the configuration of the Infobright environment can configure the brighthouse. ini file according to the requirements in README.

2) Select an efficient data type

Infobright supports all the original MySQL Data Types. The Integer type is more efficient than other data types. Use the following data types as much as possible:

TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT
DECIMAL (DECIMAL digits should be minimized)
DATE, TIME

Inefficient data types that are not recommended:

BINARY VARBINARY
FLOAT
DOUBLE
VARCHAR
TINYTEXT TEXT

Some experiences and notes about using Infobright data types:

The range of the Value Type of Infobright is a little different from that of MySQL. For example, the minimum value of the Int value of Infobright is-2147483647, while that of MySQl is-2147483648. Other numeric types have such problems.
If you can use a small data type, you can use a small data type. For example, if you can use SMALLINT, INT Is Not applicable. In this case, Infobright is consistent with MySQL.
Avoid inefficient data types, such as TEXT and so on. Use DECIMAL instead of FLOAT as much as possible. However, after all, DECIMAL will lose precision.
Try to use VARCHAR as little as possible, and the dynamic Varchar performance in MySQL is not strong, so try to avoid VARCHAR. If applicable, you can choose to change VARCHAR to CHAR storage or INTEGER type. The advantage of VARCHAR is that the length of the allocated space is variable. Since Infobright has excellent compression performance, I think VARCHAR can be converted into CHAR. CHAR has better query and compression performance.
Use INT whenever possible. In many cases, you can convert some CHAR data to an integer. For example, the customer permanent id and customer id in the search log can be stored in BIGINT instead of CHAR. In fact, it is also a good choice to divide the time into three columns for storage: year, month, and day. In the system I can see, the time is basically the most frequently used field. It is obviously very important to improve the query performance of the time field. Of course, this should be based on the specific circumstances of the system, when doing data analysis, it is necessary to those time functions of MySQL.
The varchar and char fields can also use comment lookup, which can significantly improve the compression ratio and query performance.

3) use comment lookup

Comment lookup can only be explicitly used on char or varchar. Comment Lookup can reduce the storage space and increase the compression ratio. Using comment lookup for char and varchar fields can improve the query efficiency. The implementation mechanism of Comment Lookup is similar to Bitmap indexes. In this way, the char field is replaced by a short numeric value to achieve better query performance and compression ratio. The use of Comment Lookup not only requires data types, but also requires data. Generally, the total number of data classes is less than 10000 and the number of units/category in the current column is greater than 10. Comment Lookup is suitable for age, gender, and province fields.

The usage of comment lookup is simple. You can define the following when creating a database table:

Copy codeThe Code is as follows:
Act char (15) comment 'lookup ',
Part char (4) comment 'lookup ',

4) import data as orderly as possible

We have analyzed the architecture of Infobright before. Each column is divided into n DSPs, each of which stores some statistics of DP. The orderly import of data can significantly differentiate the data in different dpcs. For example, if data is imported in chronological order, the max (date) of the previous DP <= min (date) of the next DP can reduce the suspicious DP during query and improve query performance. In other words, the orderly import of data is to make the internal data of DP more centralized, instead of so scattered.

5) use efficient query statements.

The content here is much more, which is summarized as follows:

Use in or union instead.
Reduce I/O operations because the data in infobright is compressed and the decompression process consumes a lot of time.
When querying, try to select statements with more obvious differentiation conditions.
Use the fields in where as much as possible in the Select statement. The reason is that Infobright is processed by column, and each column is processed separately. Therefore, avoid using fields not in the where clause to obtain better performance.
Limit the number of tables in the result, that is, limit the number of tables in the select statement.
Use independent subqueries and join Operations instead of non-independent subqueries.
Try not to use MySQL functions and type conversion characters in the where clause
Avoid using the MySQL optimizer for query operations whenever possible
Query operations across Infobright tables and MySQL tables
Try not to use mathematical operations in group by or subqueries, such as sum (a * B ).
Remove unnecessary fields from the select statement.
Avoid using select * from table
Avoid using union all
Try to use the functions provided by the system

When Infobright executes a query statement, most of the time is spent in the optimization phase. Although the Infobright optimizer is already very powerful, it still requires the attention of programmers when writing query statements.

InfobrightImport Tool

Insert
MySQL import tool (@ bh_dataformat = 'mysql ')
ETL Tool: http://www.infobright.org/downloads/contributed?software/
Infobright import worker: CSV format (@ bh_dataformat = 'txt _ variable'), binary format (@ bh_dataformat = 'binary ')
DLP distributed import tool (1.6 TB/hour)

Reference link:

Infobright commercial Website: http://www.infobright.com/
Infobright community communication Website: http://www.infobright.org/
Mysql introduction to infobright: http://dev.mysql.com/tech-resources/articles/datawarehousing_mysql_infobright.html
About infobright introduction video: http://www.infobright.com/Resource-Library/Webcasts-Podcasts? Infobright_product_demo

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More