Parquet File Structure notes

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Parquet is a tabular storage format for analytic business, developed by Twitter and Cloudera, and graduated from the Apache incubator in May 2015 as an Apache top-level project, so here's a summary of what the Parquet data structure really looks like.

A parquet file consists of a header and one or more block blocks, ending with a footer. The header contains only a 4-byte digital PAR1 to identify the entire Parquet file format. All metadata in the file are present in the footer. The metadata in footer contains the format version information, schema information, Key-value Paris, and all metadata information in the block. The last two fields in footer are a metadata with 4-byte-length footer and the same PAR1 as the header contains.

Note here that unlike the header of sequence files and the Avro data format file and sync markers, it is used to split blocks. The Parquet format file does not require sync markers, so the boundaries of the block are stored with the Meatada of footer.

In the parquet file, each block has a set of row groups, which are column data that consists of a set of columns chunk. Continue down, and each of the column chunk contains the pages it has. Each page contains values from the same column.

Parquet also uses a more compact form of encoding that, when written to a parquet file, automatically fits an appropriate encoding based on the column type, for example, a Boolean value will be used for run-length encoding.

Reference: "Hadoop:the Definitive Guide, 4th Edition"

Parquet File Structure notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Parquet File Structure notes

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support