Analysis of characteristics of column database

Last Update:2015-09-05 Source: Internet

Author: User

Tags sybase columnar database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://www.searchdatabase.com.cn/showcontent_61985.htm

The earliest commercially available database was the Sybase IQ released in 1995, but only slowly stabilized until around 1999 to be able to enter the production environment. Most of the analytic databases are now branching out from PostgreSQL in 2003-2005. In particular, the column database represented by Vertica has proved in the large-scale data warehouse environment that its special design for the Data Warehouse environment has a competitive advantage in some fields. This article explains some of the major features of the column-based database.

Efficient Storage-space utilization

Traditional row database because of the length of each column is different, in order to prevent the update without a row of data to jump to another block up, it is often reserved some space. While column-oriented databases exist for analysis at the outset, there is no need to consider a small number of update problems, so the data is fully stored.

Row database in order to indicate that the ID of the row will often have a pseudo-column ROWID exists. A column database typically does not save rowid.

Column database because of its different columns of data characteristics of the different algorithms to make it often has a higher than the row database compression ratio, the normal row database compression rate of about 3:1 to 5:1, and the column database compression rate is generally 8:1 to 30:1 or so. (Infobright in special applications can reach 40:1, Vertica in special applications can reach 60:1, generally so high compression rate is related to network traffic)

Because of its special IO model, the database execution engine typically does not need an index to complete a large number of data filtering tasks (except Sybase IQ). This in turn reduces the space consumption of data storage.

A materialized view is not required for a column database, and a row database typically has two materialized views in order to reduce IO, a non-polymeric view of common columns, and a materialized view of aggregations. The column database itself is a decentralized store, so there is no need for the first one, and because of the other features makes it ideal for ordinary aggregation operations. (another materialized view cannot be refreshed in real time, such as ranking functions, irregular connections to connect by, and so on, which are not included in the column database.) ）

Non-visible Index

A column database because each column of its data is sorted by selectivity, the index inside the row database is not required to reduce the distribution of IO and faster lookup values. As shown in: When the database execution engine makes a where condition filter. As long as it finds that any column of data does not meet certain criteria, the entire block of data is discarded. The final initial filter will only scan blocks that may meet the criteria.

(from Infobright:blazing Queries Using a Open Source columnar Database for high performance Analytics and Reporting)

In addition, after reading a possible block of data, for a similar age < 65 or job = ' Axx ', the column database does not need to scan the entire block, because the data is sorted. If you read the first age=66 or Job = ' Bxx ', you will stop scanning. This is quite similar to the range scan in the row database index.

Data iteration (Tuple iteration)

Now the multi-core CPU provides the L2 cache to better utilize the CPU's two-level cache and multi-core concurrency features in a short time to perform the same function many times. But the row database because its data mixed together cannot make the same simple function call to an array, so its execution efficiency is not high in the column database.

Compression algorithm

A column database is stored separately because each of its columns is separate. So it's easy to use different compression algorithms for each column's characteristics. Common column database compression algorithms include run Length Encoding, Data Dictionary, Delta Compression, BitMap Index, LZO, Null Compression, and more. The compression efficiency varies from 10w:1 to 10:1 depending on the characteristics. And the larger the data, the more obvious its compression efficiency increases.

Delay materialization

Because of its special execution engine, the column database usually does not need to extract the data in the middle process of the data, but instead of the operation with pointers, until the final output of the complete data is required.

(from Mcknight:columnar Database:data Does the Twist and Analytics shout)

Traditional row-type database operations, at the beginning of the operation to extract all the data, and then perform the subsequent filtering, projection, connection, aggregation operations

The execution plan for the column database is the same.

(from Mcknight:columnar Database:data Does the Twist and Analytics shout)

In the entire calculation process, regardless of the filtering, projection, connection, aggregation operation, the column database does not resolve the data until the final data is restored to the original data value. The benefits of doing this include reducing CPU consumption, reducing memory consumption, reducing network transmission consumption, and reducing the need for final storage.

The pros and cons of a column database

The column database comes from the data analysis of Data Warehouse in big Data Environment from the beginning, and it has some preconditions and advantages and disadvantages compared with the row database.

Column Database Benefits:

Very high loading speed (can be equal to the sum of all HDD io, basically the limit)
Suitable for large amounts of data rather than small data
Real-time loading data is limited to additions (delete and update need to decompress block then compute and then re-compress storage)
The efficient compression ratio not only saves storage space, but also saves computational memory and CPU.
Ideal for aggregation operations.

Disadvantages:

Not suitable for scanning small amounts of data
Not suitable for random updates
Batch update situation is different, some of the Better column database optimization (such as Vertica) performance is better, some not for the updated database performance is poor.
Not suitable for real-time operations that contain deletions and updates.

Common pitfalls

A common misconception is that if you scan more rows or full-scale full-table scans at a time, the row-database Billy database has the advantage. In fact, this is just a misunderstanding of the line database, that is, the main advantage of the column database is that its columns are stored separately, while ignoring the other features mentioned above in the column database, this is the core of the high performance of the column database.

Analysis of characteristics of column database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More