Parquet File Structure notes

Source: Internet
Author: User

Parquet is a tabular storage format for analytic business, developed by Twitter and Cloudera, and graduated from the Apache incubator in May 2015 as an Apache top-level project, so here's a summary of what the Parquet data structure really looks like.

A parquet file consists of a header and one or more block blocks, ending with a footer. The header contains only a 4-byte digital PAR1 to identify the entire Parquet file format. All metadata in the file are present in the footer. The metadata in footer contains the format version information, schema information, Key-value Paris, and all metadata information in the block. The last two fields in footer are a metadata with 4-byte-length footer and the same PAR1 as the header contains.

Note here that unlike the header of sequence files and the Avro data format file and sync markers, it is used to split blocks. The Parquet format file does not require sync markers, so the boundaries of the block are stored with the Meatada of footer.

In the parquet file, each block has a set of row groups, which are column data that consists of a set of columns chunk. Continue down, and each of the column chunk contains the pages it has. Each page contains values from the same column.

Parquet also uses a more compact form of encoding that, when written to a parquet file, automatically fits an appropriate encoding based on the column type, for example, a Boolean value will be used for run-length encoding.

Reference: "Hadoop:the Definitive Guide, 4th Edition"

Parquet File Structure notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.