Open-source Columnstore engine parquet and ORC

Source: Internet
Author: User

Reprinted from Dong's Blog

The Columnstore engine has a higher compression ratio and less IO operations than the traditional row storage engine (note: Columnstore is not all-powerful, many scenarios are still more efficient), especially in the number of data columns (column), but each action is only for a few columns of the story, The Columnstore engine is more cost effective.

In the Internet Big Data application scenario, in most cases, the data volume is very large and the number of data fields are many, but each query data only for a few of them, this time Columnstore is an excellent choice, currently in the open source implementation, the most famous Columnstore engine is parquet and Orc, in the last year, They are all promoted to Apache top-level projects, and their importance is visible. This article attempts to compare both of these storage engines.

Apache Parquet

Originating from the Google Dremel system (downloadable papers), Parquet is equivalent to the data storage engine in Google Dremel, while the Apache top open source project drill is the Dremel open source implementation.

Apache Parquet was originally designed to store nested data, such as Protocolbuffer,thrift,json, to store such data in a column format to facilitate its efficient compression and encoding, and to use fewer IO operations to extract the required data. This is also the advantage of parquet compared to the Orc, it can transparently protobuf and thrift type of data for Columnstore, in Protobuf and thrift is widely used today, and parquet integration, is not easy and natural things. In addition to these advantages, parquet does not have much to say about the Orc, such as it does not support update operations (it cannot be modified after the data is written), does not support acid, and so on.

Apache ORC

The orc (OPTIMIZEDRC file) store is derived from the storage format of RC (Recordcolumnar file), which is a columnstore engine with poor support for schema evolution (the need to regenerate data for schema modification), while Orc is an improvement to RC, However, it still has poor support for schema evolution, mainly in compression coding, query performance optimization. The RC/ORC was initially used in Hive, and the final momentum was good, independent of being a separate project. The support of the Hive version 1.x for transactional and update operations is based on the ORC implementation (other storage formats are not supported temporarily). The ORC has evolved to today with some very advanced feature, such as support for update operations, acid support, and support for Struct,array complex types. You can use complex types to build a nested data schema similar to parquet, but when the number of layers is very long, it is cumbersome and complex to write, and the schema representation provided by Parquet makes it easier to represent a multilevel nested data type.

Comparison of Parquet and ORC


Summarize

Currently in the Internet, Columnstore has been gradually used in various product lines, such as Twitter has converted part of the data format to parquet, the space and query time reduced by about 1/3 (source: https://adtmag.com/articles/2015/ 04/28/apache-parquet.aspx). In Twitter, the log format uses the thrift description, using parquet storage, is a typical data format description, a total of 87 fields, a 7-tier nesting relationship.

Open-source Columnstore engine parquet and ORC

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.