Open-source Columnstore engine parquet and ORC

Last Update:2015-09-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from Dong's Blog

The Columnstore engine has a higher compression ratio and less IO operations than the traditional row storage engine (note: Columnstore is not all-powerful, many scenarios are still more efficient), especially in the number of data columns (column), but each action is only for a few columns of the story, The Columnstore engine is more cost effective.

In the Internet Big Data application scenario, in most cases, the data volume is very large and the number of data fields are many, but each query data only for a few of them, this time Columnstore is an excellent choice, currently in the open source implementation, the most famous Columnstore engine is parquet and Orc, in the last year, They are all promoted to Apache top-level projects, and their importance is visible. This article attempts to compare both of these storage engines.

Apache Parquet

Originating from the Google Dremel system (downloadable papers), Parquet is equivalent to the data storage engine in Google Dremel, while the Apache top open source project drill is the Dremel open source implementation.

Apache Parquet was originally designed to store nested data, such as Protocolbuffer,thrift,json, to store such data in a column format to facilitate its efficient compression and encoding, and to use fewer IO operations to extract the required data. This is also the advantage of parquet compared to the Orc, it can transparently protobuf and thrift type of data for Columnstore, in Protobuf and thrift is widely used today, and parquet integration, is not easy and natural things. In addition to these advantages, parquet does not have much to say about the Orc, such as it does not support update operations (it cannot be modified after the data is written), does not support acid, and so on.

Apache ORC

The orc (OPTIMIZEDRC file) store is derived from the storage format of RC (Recordcolumnar file), which is a columnstore engine with poor support for schema evolution (the need to regenerate data for schema modification), while Orc is an improvement to RC, However, it still has poor support for schema evolution, mainly in compression coding, query performance optimization. The RC/ORC was initially used in Hive, and the final momentum was good, independent of being a separate project. The support of the Hive version 1.x for transactional and update operations is based on the ORC implementation (other storage formats are not supported temporarily). The ORC has evolved to today with some very advanced feature, such as support for update operations, acid support, and support for Struct,array complex types. You can use complex types to build a nested data schema similar to parquet, but when the number of layers is very long, it is cumbersome and complex to write, and the schema representation provided by Parquet makes it easier to represent a multilevel nested data type.

Comparison of Parquet and ORC

Summarize

Currently in the Internet, Columnstore has been gradually used in various product lines, such as Twitter has converted part of the data format to parquet, the space and query time reduced by about 1/3 (source: https://adtmag.com/articles/2015/ 04/28/apache-parquet.aspx). In Twitter, the log format uses the thrift description, using parquet storage, is a typical data format description, a total of 87 fields, a 7-tier nesting relationship.

Open-source Columnstore engine parquet and ORC

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Open-source Columnstore engine parquet and ORC

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Open-source Columnstore engine parquet and ORC

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support