MySQL Data Warehouse Solution infobright detailed

Source: Internet
Author: User
Tags numeric mysql functions mysql query pack mysql import

Infobright is a column-type database based on a unique proprietary knowledge grid technology. Infobright is open source MySQL Data Warehouse solution, introduced a column storage solution, high-strength data compression, optimized statistical calculation (similar to Sum/avg/group by), Infobright is based on MySQL, but do not install MySQL can also, Because it itself brought one. MySQL can be roughly divided into logical layer and physical storage engine, Infobright is mainly to implement a storage engine, but because its own storage logic and relational database is fundamentally different, so it can not be as InnoDB directly as a plug-in hook to the MySQL, Its logical layer is the logical layer of MySQL plus its own optimizer.

Infobright Features

Advantages:

Large data query performance is strong, stable: million, tens of millions, million-level record number of conditions, the same select query statements, faster than MyISAM, InnoDB and other common MySQL storage engine fast 5~60 times. Efficient query relies primarily on the optimization of the query by a specially designed storage structure, but the effect of the optimization is also dependent on the database structure and the design of the query statement.

Large amount of storage data: Terabytes of data size, billions of records. Data volume storage relies primarily on its own high-speed data-loading tool (g/hours) and high data compression ratios (>10:1)

High data compression ratio: The average can reach more than 10:1 of the data compression rate. Can even reach 40:1, greatly saving the data storage space. The high data compression ratio mainly relies on the flexible compression algorithm of the column type storage and the patent-pending.

Column-based storage: No indexing, no partitions. Even if the volume of data is very large, the query speed is very fast. For data warehouses, processing large amounts of data does not have a set of not. Without indexing, it avoids the problem of maintaining indexes and indexing as data expands. Each column of data is compressed and stored, each piece has knowledge grid node records in the block of statistics, in place of the index, accelerate the search.

Quick response to complex aggregate class queries: For complex analytical SQL queries such as SUM, COUNT, AVG, GROUP by

The value of Infobright

Save design overhead. No complex Data Warehouse model design requirements (such as star model, snowflake model), no need materialized views, data partitioning, index building

Conserve storage resources. High compression ratios are usually 10:1, and some applications can reach 40:1

Integrated utilization is extensive. Compatible with numerous BI kits, such as Pentaho, Cognos, jaspersof

Reduce the cost of operation and maintenance. As the database grows, query and load performance remains stable, implementation and management is simple and requires very little management

Business assurance. The first commercially supported open source warehousing Analysis database, the ORACLE/MYSQL official recommended warehousing integration architecture

The application scene of Infobright

Analysis and application of large amount of data. Web/Online analytics, mobile analytics, customer behavior analysis, analytics marketing and advertising

Log/Event Management system. Telecommunications detailed list analysis and reporting, System/network security certification Records

Data mart. Enterprises and institutions of specific data warehouses, to provide data warehouse for small and medium-sized enterprises

Embedded analysis. Provides embedded analysis applications for independent software vendors/SaaS vendors

Limit:

No support for Data updates: Community Edition Infobright can only import data using the "LOAD data INFILE" method, not insert, UPDATE, DELETE. This makes it difficult to modify the data, which limits its use as a data warehouse for real-time data services.

High concurrency is not supported: only more than 10 concurrent queries can be supported, although more than 10 concurrent libraries are sufficient for general applications, but low machine utilization is always a bad thing for investors, especially in the case of more concurrent small requests.

does not provide master-slave backup and scale-out functionality. If there is no master-slave backup, want to do backup, you can also load data from the same time, but only to verify the final data consistency, so that when the data load from the machine stop service time is longer; horizontally scaling, it itself is not a distributed storage system.

Compare with MySQL

Infobright is suitable for data Warehouse situations: Non-transaction, non-real-time, non-concurrency, analysis-oriented, storage of established facts, such as logs, or aggregated large amounts of data. So it's not suitable for requests from web site users. It actually takes a record a lot slower than MySQL, but it takes a 100W record faster than MySQL.

MySQL's total data file occupies more space than the actual data because it has indexes. Infobright has a powerful compression capability, compressed by columns of different types of data.

Service forms and interfaces are consistent with MySQL, you can use a MySQL-like way to enable the Infobright service, and then the original connection to the MySQL application can be a similar way to connect and query Infobright. This is a boon to skilled MySQL, the cost of learning is basically 0.

Infobright has two release versions: Open source ice and closed-source commercial IEE. Ice provides sufficient functionality, but cannot be insert,delete,update and can only load DATA INFILE. IEE In addition to providing more full functionality, it is said that the query speed is also faster.

Community Ice version, the major domestic enterprises have tested, input production system is less, mainly for the following reasons:

Restrictions on DML, alter statements

Scheduled incremental load Export Import

MyISAM is difficult to support high concurrency, if you want to make full use of server resources, you need to open another MySQL instance

Poor support for multibyte text such as Chinese

Only single core scheduling is supported

Lack of OEM support

Ice differs from IEE version

IEE includes additional features for most enterprise job requirements, such as better query performance, DML statement support, distributed import, and more. In addition, the IEE version also contains a certain level of infobright original or agents of the support of rescue services, product training.

Significant differences in query performance. Although IEE and ice versions have significantly exceeded query performance for example, Oracle, SQL Server, MySQL, and other row-style databases, IEE is 50-500% faster than the ice version. The obvious difference comes from the Multithreaded scheduling module (introduced from IEE3.5), which is unique to the IEE core engine. In ice, a separate query can only use a single CPU core, and other query processes can only use other cores. For complex queries that need to filter and distinguish large amounts of data, using IEE multithreaded scheduling module can significantly save query time.

Supports DML statements. IEE supports standard SQL data manipulation languages and uses INSERT, UPDATE, and delete to manipulate data. And ice only supports the load data infile to import, any data changes need to re-import all the data. The use of DML statements can degrade data query performance and increase with times.

Supports DDL statements. Includes ALTER TABLE Rename,add column,drop column (but column operations can only take effect on the last column)

Support for Hadoop interface (via DLP)

Advanced replication and high availability. The IEE version contains master-slave functionality, based on SQL statement

Simpler imports and faster import speeds. IEE supports the Distributed Import Tool-DLP and contains standard MySQL native loader for processing some complex data imports, on the other hand, it also shows that ibloader fault tolerance is poor

Consistent query for load or DML at the same time

Support temporary tables

Other business authorization, after-sale support, etc.

Architecture

Based on the internal architecture of MySQL –infobright take the internal architecture similar to MySQL, the following is the Infobright architecture diagram:

Infobright

The gray part is the original MySQL module, the white and blue part is infobright itself.

Infobright is a two-tier structure like MySQL:

Logic layer: Processing query logic (service and application management), the loader and unloader on the right side of the logical layer are the Infobright data import and Export module, which is also the process of SQL statement load Data INFILE ... With select ... Into the file task, because Infobright is facing the mass data environment, so this data import Export module is an independent service, not directly using MySQL module. The logical layer of the Infobright optimizer is wrapped outside of the MySQL query optimizer, as will be mentioned below, because its storage layer has some special structure, so the query optimization method is also very different from MySQL.

Storage Engine: The default storage engine for Infobright is BrightHouse, but Infobright can also support other storage engines, such as MyISAM, Mrg_myisam, Memory, CSV. Infobright organizes data through three tiers, namely DP (Data Pack), DPN (data Pack Node), KN (Knowledge node). On top of these three floors is a powerful knowledge network (Knowledge Grid).

Infobright's Module

Optimizer Optimizer. Minimize the uncompressed data and effectively improve the execution plan.

The Knowledge grid knowledge grid. Store meta data, column information, table relationships, data block distribution status statistics, equivalent query state caching information

Data Pack block. Real data compression storage location, save according to data storage block

Data Pack (block) compression layer

At the bottom of the storage engine is a data Pack (block). Each pack contains a column of 64K elements, all data in this form packaged storage, each block of data type-related compression (that is, according to different data types using different compression algorithm), compression ratio is very high. The upper layer of the compressor and the decompression device to do this thing.

Infobright claims that the data compression ratio is 10:1 to 40:1. We've already said that. Infobright compression is based on the data type of DP, the system automatically chooses the compression algorithm, and adaptively adjusts the parameters of the algorithm to achieve the optimal compression ratio. First look at the compression ratio in the experimental environment, as shown in the following figure:

The overall compression ratio is 20.302. But there is a misconception here that the compression ratio refers to the size of the original data in the database/compressed data size, not the physical data size of the text file/compressed data size. Obviously the former will be a lot bigger than the latter. In my experimental environment, the latter is about 7:1. Generally, text data is stored in a database larger than the original text, because some of the fields are set to a fixed length, taking up more space than is actually the case. There is a lot of statistical data in the database, including the index, these statistics data occupy the space is absolutely not small. Infobright is not indexed, but it has KN data, which typically accounts for about 1% of the total size of the data.

Now that Infobright will compress according to the specific data type, let's look at the compression ratios of the different data types. As shown in the following table:

First look at the compression ratio of type int, the result is the compression ratio on int

To see the Act field, the Act field uses comment lookup, which has a better compression ratio and query performance than a simple char type. The principle of comment lookup is actually more like a bitmap index. The next section on the use of comment lookup will be carefully described. In all fields the Date field has the highest compression ratio and the final data size is only 0.1M. VarChar compression ratio is poor, so unless necessary, otherwise do not recommend the use of varchar.

The above data clearly shows Infobright's powerful compression performance. Again, the data compression is not only related to the data type, but the degree of the difference of the data plays a particularly important role. When selecting field data types, individuals feel that performance considerations should be the first. For example, the selection of some fields in the table above can be optimized, IP can be changed to bigint type, date can be split into year/month/day three columns as needed.

Knowledge grid (Knowledge grid)

The compression layer is infobright the most important concept: knowledge grid (knowledge grid) This is also infobright discard the index can be applied to a large number of data query basis. Knowledge Grid architecture is an important reason for infobright performance. It contains two types of nodes:

Data Pack node (block nodes): the one by one corresponding relationship is the data Pack node and the date pack. DPN records Some of the statistics that are stored and compressed in each DP, including the maximum (max), Minimum (min), number of NULL, number of units count, sum. AVG, and so on. To the amount of different values and so on; Knowledge node stores some of the more advanced statistics, as well as the connection information with other tables, some of the information in the data load is already good, some of the query is calculated, so there is a certain "smart".

Knowledge node stores a collection of metadata that points to a DP or relationship between columns, such as the range of values that occur (Miin_max), and the association between column data. Most of the KN data is generated when the data is loaded, while others are generated when querying.

Knowledge grid can be divided into four parts, DPN, histogram, CMAP, p-2-p.

DPN as described above.

Histogram is used to improve the performance of queries for numeric types, such as Date,time,decimal. The histogram is generated when the data is loaded. DPN in the mix, max,histogram in the Min-max into 1024 paragraphs, if the Mix_max range is less than 1024, each paragraph is a separate value. At this point kn is the binary representation of whether a number is in the current segment.

Histogram's role is to quickly determine whether the current DP satisfies the query criteria. As shown in the figure above, such as the Select ID from CustomerInfo where id>50 and id<70. Then it is easy to get the current DP does not meet the condition. So histogram can effectively reduce the number of query DP for that kind of digitally qualified query.

CMAP is a query for text types and is generated when data is loaded. CMap is the statistic of the current DP, where ASCII occurs in 1-64 locations. As shown in the following figure

For example, the above illustration shows that a is never seen in the second, third, fourth position of the text. 0 indicates no appearance, and 1 indicates that it has occurred. The comparison of the text in the query is based on the byte, so the performance of the text query can be improved well according to CMAP.

Pack-to-pack is generated when a join operation is a bitmap that represents the relationship between two columns of the operation in the two DP of a join, which is the matrix represented by the binary.

stored in memory, scoped in a sission

Improve join query performance, whether it's new or reused

Rough Set (Rough Sets) is one of the core technologies of Infobright. Infobright divides the DP into three categories according to the Knowledge Network (knowledge Grid) when executing the query:

Related DP (relevant packs), a DP that satisfies the restriction of query conditions

Unrelated DP (irrelevant packs), a DP that does not meet query conditions

Suspicious DP (suspect packs), the data inside the DP satisfies the restriction of the query condition.

Case:

SELECT COUNT (*) from Employees WHERE salary > 100000 and, < job = ' IT ' and city = ' San Mateo ';

Find packets that contain salary > 100000

Find packets that contain age < 35

Find a packet that contains job = ' IT '

Find packets containing city = ' San Mateo '

Remove all tags that are irrelevant to the search criteria

Finally, the relevant data is extracted in the determined data packet.

Perform a search

As you can see from the analysis above, Infobright can perform some queries efficiently, and the higher the distinction of the where statement is, the better. Where the high degree of discrimination can be more precise to determine whether the relevant DP or is not related to DP or DP, minimize the number of DP, reduce the performance loss of decompression. In the use of conditional judgment, the histogram and cmap mentioned in the previous chapter are generally used, which can effectively improve query performance. The principle of multiple table joins is similar. First, the relationship between the DP of the two columns that generated the join is exploited by pack-to-pack. For example: SELECT MAX (X.D) from T JOIN X on t.b = x.c WHERE t.a > 6. Pack-to-pack produces the relationship matrix M between the T.B and the x.c DP. Assuming that there is an element crossover between the first DP of the T.B and the first DP of X.C, then m[1,1]=1, otherwise m[1,1]=0. This effectively reduces the number of DP in the join operation. The front drops down to understand the compression, incidentally, by mentioning the DP compression. The 64K elements in each DP are treated as a sequence in which all NULL locations are stored separately, and the rest of the non-null data is compressed. Data compression is related to the type of data, and Infobright chooses the compression algorithm based on the type of data. Infobright adjusts the parameters of the algorithm adaptively to achieve the optimal compression ratio.

Knowledge Grid is still more complex, there are many details, you can refer to the official white paper and Brighthouse:an analytic Data Warehouse for AD-HOC queries this paper.

Use of comment lookup

The structure of Infobright is analyzed, and the compression process and working principle of infobright are briefly introduced. Now let's discuss query optimization issues.

1 configuration environment: Under Linux, the configuration of Infobright environment can be configured according to the requirements of the Readme, Brighthouse.ini files.

2 Select an efficient data type

The infobright inside supports all MySQL's original data types. Where the integer type is more efficient than other data types. Use the following data types whenever possible:

Tinyint,smallint,mediumint,int,bigint

Decimal (Minimize the number of decimal places)

DATE, Time

Low efficiency, deprecated data types are:

BINARY VARBINARY

FLOAT

DOUBLE

VARCHAR

Tinytext TEXT

Some experiences and points of attention for the use of Infobright data types:

The range of numeric types for infobright is somewhat different from MySQL, such as the minimum value of Infobright int is-2147483647, while MySQL's int minimum value should be-2147483648. This is the problem with other numeric types.

The ability to use small data types to use small data types, such as being able to use smallint does not apply to int, is infobright consistent with MySQL.

Avoid inefficient data types, such as text can not be used, such as float as far as possible to replace the decimal, but it is necessary to weigh the decimal loss of precision.

As little as possible with varchar, in the MySQL dynamic varchar performance is not strong, so try to avoid varchar. If appropriate, you can choose to convert varchar to char storage or even a special integer type. The advantage of varchar is that the length of the allocated space is variable, since Infobright has such a good compression performance, the individual thinks that the varchar can be converted to char. Char will have better query and compression performance.

You can use int as much as possible, and sometimes you can even convert some char-type data to an integer. For example, the customer permanent ID, customer ID, and so on in the search log can be stored in bigint without char storage. In fact, the time to split into year, month, day three-column storage is also a good choice. In the system I can see the time is basically the most frequently used fields, improve the Time field query performance is obviously very important. Of course, this is still to be based on the specific circumstances of the system, do data analysis when there is a need for MySQL time functions.

VarChar and Char fields can also use comment lookup,comment lookup to significantly increase compression ratios and query performance.

3) Using Comment Lookup

Comment Lookup can only be used explicitly on char or varchar. Comment lookup reduces storage space, increases compression, and uses Comment lookup for char and varchar fields to improve query efficiency. The Comment Lookup implementation mechanism is much like a bitmap index, which achieves better query performance and compression ratios by using short numeric types instead of char fields. The use of Comment lookup in addition to the data type requirements, there are certain requirements for the data. General requirements The total number of data categories is less than 10000 and the number of units/categories in the current column is greater than 10. Comment Lookup is more suitable for the age, sex, and province of this type of field.

Comment Lookup is simple to use when you create a database table by defining the following:

Act char comment ' lookup ',

Part char (4) Comment ' lookup ',

4) as much as possible in order to import data

The previous analysis of the Infobright architecture, each column is divided into n dp, each DPN column surface stores some of the DP statistics. The orderly import of data can make the data in different DP DPN more obvious. For example, to import data in chronological order, the previous DP's MAX (date) <= the next DP's min (date), which reduces the suspect DP and improves query performance. In other words, importing data in an orderly manner makes the DP internal data more concentrated and less fragmented.

5 Use efficient query statements

There are a lot of things involved here, summarized as follows:

Try not to apply or, you can use in or union instead

Reduce IO operations because the data inside the infobright is compressed and the decompression process consumes a lot of time.

As far as possible when the query to select the more obvious statement of the difference

In the Select, use the fields that appear in the where as much as possible. The reason is that Infobright is processed by column, and each column is handled separately. So avoid using fields that don't appear in the Where to get better performance.

The number of tables that are restricted to the results, that is, the number of tables that appear in the Select.

Try to use independent subqueries and join operations to replace dependent subqueries

Try not to use MySQL functions and type conversions in the Where

Try to avoid query operations that use the MySQL optimizer

Using query operations that span infobright tables and MySQL tables

Try not to use mathematical operations, such as SUM (A*B), in group by or subqueries.

Select to remove the field as much as possible.

Avoid using SELECT * FROM table

Avoid using UNION ALL

Use System-supplied functions as much as possible

Infobright Execute query statements, most of the time is spent in the optimization phase. The Infobright optimizer is already very powerful, but many of the details of writing query statements need to be noticed by programmers.

Infobright Import Tool

Insert

MySQL Import tool (@bh_dataformat = ' mysql ')

ETL Tools: http://www.infobright.org/Downloads/Contributed?Software/

Infobright its own importer: CSV format (@bh_dataformat = ' txt_variable '), binary format (@bh_dataformat = ' binary ')

DLP Distributed Import Tool (1.6tb/hours)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.