Reprinted from Dong's Blog
The Columnstore engine has a higher compression ratio and less IO operations than the traditional row storage engine (note: Columnstore is not all-powerful, many scenarios are still more efficient), especially in the number of data columns (column), but each action is only for a few columns of the story, The Columnstore engine is more cost effective.
In the Internet Big Data application scenario, in most cases, the data volume is very large and the number of data fields are many, but each query data only for a few of them, this time Columnstore is an excellent choice, currently in the open source implementation, the most famous Columnstore engine is parquet and Orc, in the last year, They are all promoted to Apache top-level projects, and their importance is visible. This article attempts to compare both of these storage engines.
Apache Parquet
Originating from the Google Dremel system (downloadable papers), Parquet is equivalent to the data storage engine in Google Dremel, while the Apache top open source project drill is the Dremel open source implementation.
Apache Parquet was originally designed to store nested data, such as Protocolbuffer,thrift,json, to store such data in a column format to facilitate its efficient compression and encoding, and to use fewer IO operations to extract the required data. This is also the advantage of parquet compared to the Orc, it can transparently protobuf and thrift type of data for Columnstore, in Protobuf and thrift is widely used today, and parquet integration, is not easy and natural things. In addition to these advantages, parquet does not have much to say about the Orc, such as it does not support update operations (it cannot be modified after the data is written), does not support acid, and so on.
Apache ORC
The orc (OPTIMIZEDRC file) store is derived from the storage format of RC (Recordcolumnar file), which is a columnstore engine with poor support for schema evolution (the need to regenerate data for schema modification), while Orc is an improvement to RC, However, it still has poor support for schema evolution, mainly in compression coding, query performance optimization. The RC/ORC was initially used in Hive, and the final momentum was good, independent of being a separate project. The support of the Hive version 1.x for transactional and update operations is based on the ORC implementation (other storage formats are not supported temporarily). The ORC has evolved to today with some very advanced feature, such as support for update operations, acid support, and support for Struct,array complex types. You can use complex types to build a nested data schema similar to parquet, but when the number of layers is very long, it is cumbersome and complex to write, and the schema representation provided by Parquet makes it easier to represent a multilevel nested data type.
Comparison of Parquet and ORC
Summarize
Currently in the Internet, Columnstore has been gradually used in various product lines, such as Twitter has converted part of the data format to parquet, the space and query time reduced by about 1/3 (source: https://adtmag.com/articles/2015/ 04/28/apache-parquet.aspx). In Twitter, the log format uses the thrift description, using parquet storage, is a typical data format description, a total of 87 fields, a 7-tier nesting relationship.
Open-source Columnstore engine parquet and ORC