Brief introduction
Apache Parquet is a columnstore format used primarily for the Hadoop ecosystem. Regardless of the data processing framework, model, and programming language. Cloudera's Big Data Online analysis (OLAP) project uses this format as a column store in Impala.
Parquet is a columnstore internal to Twitter, currently open source and hosting the code in Parquet-format on
parquet is a tabular storage format for use with Hadoop. The Parquet provides column-based data representations that support high-efficiency compression for all projects in the Hadoop ecosystem, and is not related to data processing frameworks, data models, or programming languages.
just as Google Dremel , parquet consists of a number of complex nested data structures that encode data structures using a repeating level/definition level (repetition/definition levels) approach. This approach enables excellent simple, flat nested namespaces.
Parquet supports compressing a column of data, and more coding is possible in the future. By separating the concept of coding from compression, Parquet users can directly implement and manipulate the encoded data without having to go through the process of decompression and compression.
Parquet is designed for anyone to use. There are a number of data processing frameworks in the Hadoop ecosystem, and an efficient, easy-to-implement Columnstore template should be used by all frameworks.
Java building blocks for working with tabular data, and the Hadoop input/output format tool, pig storage/load, and integrated Parquet format conversion tool (PARQUET-MR) are included.
Parquet metadata is encoded using Apache Thrift.
Parquet support for data nesting in a tabular data storage format