Analysis of characteristics of column database

Source: Internet
Author: User
Tags sybase columnar database

Transferred from: http://www.searchdatabase.com.cn/showcontent_61985.htm

The earliest commercially available database was the Sybase IQ released in 1995, but only slowly stabilized until around 1999 to be able to enter the production environment. Most of the analytic databases are now branching out from PostgreSQL in 2003-2005. In particular, the column database represented by Vertica has proved in the large-scale data warehouse environment that its special design for the Data Warehouse environment has a competitive advantage in some fields. This article explains some of the major features of the column-based database.

Efficient Storage-space utilization

Traditional row database because of the length of each column is different, in order to prevent the update without a row of data to jump to another block up, it is often reserved some space. While column-oriented databases exist for analysis at the outset, there is no need to consider a small number of update problems, so the data is fully stored.

Row database in order to indicate that the ID of the row will often have a pseudo-column ROWID exists. A column database typically does not save rowid.

Column database because of its different columns of data characteristics of the different algorithms to make it often has a higher than the row database compression ratio, the normal row database compression rate of about 3:1 to 5:1, and the column database compression rate is generally 8:1 to 30:1 or so. (Infobright in special applications can reach 40:1, Vertica in special applications can reach 60:1, generally so high compression rate is related to network traffic)

Because of its special IO model, the database execution engine typically does not need an index to complete a large number of data filtering tasks (except Sybase IQ). This in turn reduces the space consumption of data storage.

A materialized view is not required for a column database, and a row database typically has two materialized views in order to reduce IO, a non-polymeric view of common columns, and a materialized view of aggregations. The column database itself is a decentralized store, so there is no need for the first one, and because of the other features makes it ideal for ordinary aggregation operations. (another materialized view cannot be refreshed in real time, such as ranking functions, irregular connections to connect by, and so on, which are not included in the column database.) )

Non-visible Index

A column database because each column of its data is sorted by selectivity, the index inside the row database is not required to reduce the distribution of IO and faster lookup values. As shown in: When the database execution engine makes a where condition filter. As long as it finds that any column of data does not meet certain criteria, the entire block of data is discarded. The final initial filter will only scan blocks that may meet the criteria.

(from Infobright:blazing Queries Using a Open Source columnar Database for high performance Analytics and Reporting)

In addition, after reading a possible block of data, for a similar age < 65 or job = ' Axx ', the column database does not need to scan the entire block, because the data is sorted. If you read the first age=66 or Job = ' Bxx ', you will stop scanning. This is quite similar to the range scan in the row database index.

Data iteration (Tuple iteration)

Now the multi-core CPU provides the L2 cache to better utilize the CPU's two-level cache and multi-core concurrency features in a short time to perform the same function many times. But the row database because its data mixed together cannot make the same simple function call to an array, so its execution efficiency is not high in the column database.

Compression algorithm

A column database is stored separately because each of its columns is separate. So it's easy to use different compression algorithms for each column's characteristics. Common column database compression algorithms include run Length Encoding, Data Dictionary, Delta Compression, BitMap Index, LZO, Null Compression, and more. The compression efficiency varies from 10w:1 to 10:1 depending on the characteristics. And the larger the data, the more obvious its compression efficiency increases.

Delay materialization

Because of its special execution engine, the column database usually does not need to extract the data in the middle process of the data, but instead of the operation with pointers, until the final output of the complete data is required.

(from Mcknight:columnar Database:data Does the Twist and Analytics shout)

Traditional row-type database operations, at the beginning of the operation to extract all the data, and then perform the subsequent filtering, projection, connection, aggregation operations

The execution plan for the column database is the same.

(from Mcknight:columnar Database:data Does the Twist and Analytics shout)

In the entire calculation process, regardless of the filtering, projection, connection, aggregation operation, the column database does not resolve the data until the final data is restored to the original data value. The benefits of doing this include reducing CPU consumption, reducing memory consumption, reducing network transmission consumption, and reducing the need for final storage.

The pros and cons of a column database

The column database comes from the data analysis of Data Warehouse in big Data Environment from the beginning, and it has some preconditions and advantages and disadvantages compared with the row database.

Column Database Benefits:

    • Very high loading speed (can be equal to the sum of all HDD io, basically the limit)
    • Suitable for large amounts of data rather than small data
    • Real-time loading data is limited to additions (delete and update need to decompress block then compute and then re-compress storage)
    • The efficient compression ratio not only saves storage space, but also saves computational memory and CPU.
    • Ideal for aggregation operations.

Disadvantages:

    • Not suitable for scanning small amounts of data
    • Not suitable for random updates
    • Batch update situation is different, some of the Better column database optimization (such as Vertica) performance is better, some not for the updated database performance is poor.
    • Not suitable for real-time operations that contain deletions and updates.

Common pitfalls

A common misconception is that if you scan more rows or full-scale full-table scans at a time, the row-database Billy database has the advantage. In fact, this is just a misunderstanding of the line database, that is, the main advantage of the column database is that its columns are stored separately, while ignoring the other features mentioned above in the column database, this is the core of the high performance of the column database.

Analysis of characteristics of column database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.