Parquet and ORC: high-performance Columnstore format (favorites)

Source: Internet
Author: User
Tags repetition hadoop ecosystem

Background

With the advent of the Big Data era, more and more data flows to the Hadoop ecosystem, while the ability to get valuable data from terabytes and even petabytes of data is even more important for a product and company, and a number of open source data analysis engines emerge during the rapid development of the Hadoop ecosystem , such as Hive, Spark SQL, Impala, Presto and so on, but also produced a number of high-performance Columnstore format, such as Rcfile, ORC, parquet, etc., this paper mainly from the perspective of implementation of the ORC and parquet two typical form of the storage, and the corresponding comparison test.

In-row storage

Because of the characteristics of OLAP queries, Columnstore can improve its query performance, but how does it work? This is going to start with the principle of columnstore, as you can see in Figure 1 that all the elements of each column are stored sequentially in relation to the row-style storage that is typically used in a relational database. This feature can bring the following optimizations to the query:

    • The query does not need to scan all the data, but only need to read the columns involved in each query, which can reduce I/O consumption by n times, you can also save the statistics of each column (min, max, SUM, etc.), the implementation of the partial predicate push.
    • Because the members of each column are isomorphic, you can further reduce I/O by using a more efficient data compression algorithm for different data types.
    • Because of the isomorphism of the members of each column, it is possible to reduce the CPU cache invalidation by using a more appropriate encoding for CPU pipeline.

Figure 1 Row-Storage VS-Columnstore

Nested data formats

Typically we use relational databases to store structured data, while relational databases support data models that are flat, and users need to parse themselves when they encounter such things as list, map, and custom structs, but in big data environments, the data comes in a variety of sources, such as buried-point data, It is likely that some object content in the program will need to be part of the output, and each object may be nested, so if it can support this data in its native form, the query will not need additional parsing to get the desired result. For example, on Twitter, they have a typical log object (a record) with 87 fields, which nest 7 layers, such as.

Figure 2 Nested data Model

With the increasing demand for data in nested formats, the current query engine in the Hadoop ecosystem supports richer data types, such as Hive, Sparksql, Impala, and other primitive support for complex data types such as structs, maps, and arrays. This enables various storage formats to support nested data formats.

Parquet storage format

Apache Parquet is a new columnstore format in the Hadoop ecosystem that is compatible with most computing frameworks (Mapreduce, spark, etc.) in the Hadoop ecosystem and is supported by multiple query engines (Hive, Impala, drill, etc.). And it is language and platform-independent. Parquet was originally developed and open source by Twitter and Cloudera, graduating from the Apache incubator in May 2015 to become Apache's top project.

Parquet was originally inspired by the Dremel paper published by Google in 2010, which describes a storage format that supports nested structures and improves query performance using Columnstore methods. The Dremel paper also describes how Google uses this storage format to implement parallel queries, and if this is interesting, consult the paper and open source for drill.

Data model

Parquet supports nested data models, similar to protocol buffers, where each data model schema contains multiple fields, each of which has three properties: repetition, data type, and field name, the number of repetitions can be three: required (1 occurrences only), Repeated (occurs 0 or more times), optional (occurs 0 or 1 times). The data types of each field can be divided into two categories: group (complex type) and primitive (base type). For example, the document schema example provided in Dremel is defined as follows:

message Document {  required int64 DocId;  optional group Links {    repeated int64 Backward;    repeated int64 Forward;   }  repeated group Name {    repeated group Language {      required string Code;      optional string Country;      }    optional string Url;   }}

This schema can be converted to a tree structure, the root node can be understood as the repeated type, 3.

Figure 3 Schema structure of parquet

It can be seen that in the schema all the basic types of fields are leaf nodes, there are altogether 6 leaf nodes in this schema, if you convert such a schema into a flat relational model, it can be understood that the table contains six columns. There is no complex data structure such as map or array in parquet, but it can be implemented by repeated and group combination. Because a column in a record may occur 0 or more times, it is necessary to indicate which column values constitute a complete record. This is implemented by the striping/assembly algorithm.

Because the data model supported by Parquet is loosely structured, there may be a deep nested relationship in one record, and if maintaining a similar tree for each record can take up more storage space, Therefore, an efficient compression algorithm for nested data formats is proposed in Dremel paper: striping/assembly algorithm. The principle is that each member value in each record consists of three parts: value, repetition level, and definition level. Value records the original value of the member, which can be compressed according to a specific type of compression algorithm, and two level values are used to record the position of the value throughout the record. For columns of type repeated, the repetition level value records which record the current value belongs to and where it is in the record, and for columns of type repeated and optional, there may be no value for one of the columns in a record. Assuming that we do not log such a value causes the value that should belong to the next record to be considered part of the current record, resulting in an error in the data, so a placeholder is required to indicate this situation.

With the striping/assembly algorithm, parquet can use less storage space to represent complex nested formats, and usually repetition level and definition level are small integer values that can be compressed by the RLE algorithm, Further reduce storage space.

File structure

Parquet files are stored in binary mode, is not directly read and modified, the parquet file is self-parsing, the file contains the file's data and metadata. There are several concepts in the HDFs file system and the Parquet file:

    • HDFs BLOCK: It is the smallest replica unit on HDFs, which stores a block in a local file and maintains multiple copies on different machines, usually with a block size of 256M, 512M, and so on.
    • HDFs file: An HDFs file that includes data and metadata that is distributed across multiple blocks.
    • Row group: Data is physically divided into units by row, each row group contains a certain number of rows, and at least one row group is stored in an HDFs file, and the entire row group is cached in memory when Parquet reads and writes, so if the size of each row group is determined by the small amount of memory.
    • Column Chunk: Each column in a row group is saved in a column block, and all columns in the row group are stored consecutively in the row group file. Different column blocks may be compressed using different algorithms.
    • Page: Each column block is divided into pages, a page is the smallest encoded unit, and different pages in the same column block may use different encoding methods.

Typically, when storing parquet data, the size of the row group is set according to the block size of HDFs, since the smallest unit of processing data for each mapper task is a block, so that each row group can be processed by a mapper task. Increases the execution parallelism of the task. The format of the parquet file is as shown.

Figure 4 Parquet File structure

Shows the structure of a parquet file, a file can store multiple row groups, the first file is the Magic Code of the file, to verify whether it is a parquet file, Footer length stores the size of the file metadata, This value and file length can be used to calculate the offset of the metadata, which includes metadata information for each row group and schema information for the current file. In addition to the metadata for each row group in a file, the page's metadata is stored at the beginning of each page, and in parquet there are three types of pages: data pages, dictionary pages, and index pages. The data page is used to store the value of the column in the current row group, and the dictionary page stores an encoded dictionary of the column values, containing a maximum of one dictionary page per column block, which is used to store the index of the column under the current row group, but is not currently supported in parquet, but is incremented in later versions.

Data access

When it comes to the benefits of Columnstore, Project push is undoubtedly the most prominent, which means that only the columns needed in the query need to be scanned when the original data in the table is fetched, since all values in each column are stored continuously, avoiding scanning the entire table file contents.

Parquet in the middle of the original support project push, when executing the query can pass through the configuration of the column to be read information, these columns must be a subset of the schema, parquet each time will scan a row group of data, The cloumn chunk of all the required columns in the row group are then read into memory at once, and each time a row group's data is read can greatly reduce the number of random reads, in addition, the parquet will consider whether the column is sequential when reading, If the required column is a contiguous storage location, a read operation can read multiple columns of data into memory.

In the process of data access, parquet can also use the statistics generated by each row group to push down the predicate, which includes the maximum, minimum, and null values of the column chunk. These statistics and the filter criteria of the column determine whether the row group needs to be scanned. In addition, Parquet will add optimized data, such as Bloom Filter and index, to perform predicate push-down more effectively.

Orc file format

The Orc file format is a columnstore format in the Hadoop ecosystem that originated early in 2013 and was originally generated from Apache hive to reduce Hadoop data storage and speed up Hive queries. Similar to parquet, it is not a purely columnstore format, but it is still the first to split the entire table based on the row groups and store them in columns within each row group. The orc file is self-describing, its metadata is serialized using protocol buffers, and the data in the file is compressed as much as possible to reduce the consumption of storage space and is currently supported by the query engine such as Spark SQL, Presto, but Impala does not currently support the Orc, Parquet is still used as the primary columnstore format. The 2015 ORC project was promoted to Apache's top project by the Apache Project Foundation.

Data model

Unlike parquet, Orc native does not support nested data formats, but rather supports nested formats by handling complex data types in a special way, such as the following hive table:

CREATE TABLE `orcStructTable`(  `name` string,  `course` struct<course:string,score:int>,  `score` map<string,int>,  `work_locations` array<string>)

The ORC format converts it to the following tree structure:

Figure 5 The schema structure of the ORC

In the Orc's structure, this schema contains 10 columns, including the list, STRUCT, map, and union types of the complex Type column and the original type, which includes the Boolean, Integer, floating-point, String type, and so on. Where the child node of the struct includes its member variables, there may be multiple child nodes, map has two child nodes, key and value,list contain a child node, type is the member type of the list, Union is generally not used to get. The root node of each schema tree is a struct type, and all column numbers are numbered according to the tree's ordinal traversal sequence.

The Orc only needs to store the values of the leaf nodes in the schema tree, while the non-leaf nodes in the middle do just one layer of proxy, they only need to be responsible for the child node is worth reading, only the real leaf node will read the data, and then the parent node encapsulated into the corresponding data structure returned.

File structure

Similar to parquet, orc files are also stored in binary mode, so it is not possible to read directly, the Orc file is self-parsing, it contains a lot of metadata, which are homogeneous protobuffer serialized. The orc file structure is in Figure 6, which involves the following concepts:

    • Orc file: A generic binary file stored on a file system, an orc file can contain multiple stripe, each stripe contains multiple records, which are stored independently of columns, corresponding to the concept of row group in parquet.
    • FILE-level metadata: Includes the description of the file PostScript, file meta information (including the entire file statistics), all stripe information and file schema information.
    • Stripe: A set of rows forms a stripe, each time a file is read in a row group, typically the block size of HDFs, and the index and data for each column are saved.
    • Stripe metadata: The location where the stripe is saved, the statistics for each column in the stripe, and all the stream types and locations.
    • Row group: The smallest unit of the index, one stripe contains multiple row group, and the default is 10,000 values.
    • Stream: A stream represents a valid piece of data in a file, including both index and data categories. The index stream holds the location and statistics of each row group, and the data stream includes multiple types of data, which are determined by the column type and encoding method.

Figure 6 ORC file structure

Three levels of statistics are saved in the Orc file, at the file level, stripe level, and row group level, and they can be used to determine if certain data can be skipped based on the search ARGuments (predicate push condition). The statistics contain the number of members and whether there are null values, and some specific statistics are set for different types of data.

Data access

The read Orc file starts at the tail, reads the size of 16KB for the first time, and reads the PostScript and footer data into memory as much as possible. The last byte of the file holds the length of the PostScript, which is no more than 256 bytes in length, and the PostScript holds the entire file's metadata information, including the compressed format of the file, the maximum length of each compressed block within the file (the size of each memory allocated), Footer length, as well as some version information. The statistics for the entire file are stored between PostScript and footer (not drawn in), and the statistics include information about each column in each stripe, the number of members, the maximum, the minimum, the null value, and so on.

Next read the file's footer information, which contains the length and offset of each stripe, the schema information of the file (the schema tree is saved in the array by the number in the schema), the entire file's statistics, and the number of rows per row group.

When dealing with stripe, we first get the location and length of each stripe from footer, and the footer data of each stripe (metadata, which records the length of index and data), and the whole striper is divided into index and data two parts. The stripe interior is chunked by row group (how many records in each row group are stored in the file's footer), and the row group is stored internally by column. Each row group holds data and index information from multiple streams. The data for each stream is saved using a specific compression algorithm based on the type of the column. There are several stream types in the orc:

    • PRESENT: Each member value is kept in this stream by one (bit) to indicate whether the value is null, through which you can record only the null value of the part
    • DATA: The member value in the column that belongs to the current stripe.
    • Length: For each member, this is for columns of type string.
    • Dictionary_data: The contents of the dictionary after the string type data is encoded.
    • Secondary: Stores decimal, timestamp-type decimals, or nanoseconds, and so on.
    • Row_index: Saves the statistics for each row group in the stripe and the start location information for each row group.

After getting all the metadata in the initialization phase, you can specify the column number that needs to be read through the includes array, which is a Boolean array that reads all the columns if you do not specify, and you can specify the filter by passing the Searchargument parameter. Depending on the metadata, the index information in each stripe is read first, then the row group number that needs to be read is determined based on the statistics in index and the searchargument parameter, and then according to the includes data, the row The column read in the group, through which the two layers of filtering need to read the data is only the entire stripe multiple small segments of the interval, then the orc will merge as many discrete intervals as possible to reduce the number of I/O. The location of the next row group saved in index is then transferred to the first row group in the stripe that needs to be read.

Because more accurate index information is used in the Orc, which allows the reading of data to be read from any line, finer-grained statistics allow the read Orc file to skip the entire row Group,orc by default using zlib compression for any piece of data and index information. As a result, the Orc file occupies less storage space, which is also confirmed in the comparison of the tests later.

Support for the Bloom filter is also added to the new version of the Orc, which can further improve the efficiency of the predicate push-down, and has joined the support for Hive 1.2.0.

Performance testing

To compare the two storage formats, I chose to use the Tpc-ds dataset and transform it to produce wide, nested, and multi-layered nesting data. Test with the most commonly used hive as the SQL engine.

Test environment
    • Hadoop cluster: Physical test cluster, four Datanode/nodemanager machines, each machine 32CORE+128GB, testing using the entire cluster of resources.
    • Hive:hive 1.2.1 version, using Hiveserver2 startup, native MySQL as meta database, JDBC Way to submit query SQL
    • Datasets: 100GB tpc-ds datasets, choosing Store_sales as the test data for the fact table model
    • Query sql: Select the 10 SQL that is involved in the above model in Tpc-ds and transform it.
Test Scenarios and Results

The entire test set up four scenarios, each of which compares the size of the storage space consumed by the test data with the time spent by the same query execution, in addition to the scenario one based on the original Tpc-ds dataset, the rest of the data needs to be imported, while comparing the data import time of these scenes.

Scenario One: a fact table, multiple dimension tables, complex join queries.

Based on the original TPC-DS data set.

Store_sales table record number: 287,997,024, table size:

    • Original text format, uncompressed: 38.1 G
    • Orc format, default compression (ZLIB), altogether 1800+ partitions: 11.5 G
    • Parquet format, default compression (Snappy), altogether 1800+ partitions: 14.8 G

Query test results:

Scenario two: A wide table that is generated after the dimension table and fact table join, and queries are made on only one table.

The entire test set up four scenarios, each of which compares the size of the storage space consumed by the test data with the time spent by the same query execution, in addition to the scenario one based on the original Tpc-ds dataset, the rest of the data needs to be imported, while comparing the data import time of these scenes. Select Store_sales, Household_demographics, Customer_address, Date_dim, and store tables in the data model to generate a flat wide table (store_sales_wide_table), Based on this table to execute the query, because the scene one of the selected query most do not match to this wide table, so the SQL in scenario 1 is partially modified.

Store_sales_wide_table table record number: 263,704,266, table size:

    • Original text format, uncompressed: 149.0 G
    • Orc format, default compression: 10.6 G
    • Parquet format, default compression: 12.5 G

Query test results:

Scenario Three: Complex data structure consisting of a wide table, struct, list, map, etc. (1 layers)

The entire test set up four scenarios, each of which compares the size of the storage space consumed by the test data with the time spent by the same query execution, in addition to the scenario one based on the original Tpc-ds dataset, the rest of the data needs to be imported, while comparing the data import time of these scenes. On the basis of scene two, the dimension table (except the Store_sales table) is converted to a struct or map object, and the fields in the Source Store_sales table remain unchanged. A new table (store_sales_wide_table_one_nested) with a nested layer is generated, using the same query logic.

store_sales_wide_table_one_nested table record number: 263,704,266, table size:

    • Original text format, uncompressed: 245.3 G
    • Orc format, default compression: 10.9 G is smaller than the Store_sales table?
    • Parquet format, default compression: 29.8 G

Query test results:

Scenario four: Complex data structures, multi-layered nesting. (3 floor)

The entire test set up four scenarios, each of which compares the size of the storage space consumed by the test data with the time spent by the same query execution, in addition to the scenario one based on the original Tpc-ds dataset, the rest of the data needs to be imported, while comparing the data import time of these scenes. On the basis of scenario three, the fields in the struct of some dimension tables are converted into a struct or map object, only the nested map in the struct is present, the deepest nesting is three layers. Generates a new, multi-layered, nested table (store_sales_wide_table_more_nested) with the same query logic.

This scenario involves only a multi-layered nested wide table with no partition fields, store_sales_wide_table_more_nested table records: 263,704,266, table size:

    • Original text format, uncompressed: 222.7 G
    • Orc format, default compression: 10.9 G is smaller than the Store_sales table?
    • Parquet format, default compression: 23.1 G smaller than a layer of nested table store_sales_wide_table_one_nested?

Query test results:

Results analysis

From the above test results, the star model for the data analysis scenario is not very appropriate, the join of multiple tables will greatly slow down the query speed, and does not take advantage of the performance improvement of columnar storage, in the case of wide tables, the performance of columnar storage is significantly improved, The orc file format is much better than the text format in the storage space, and has a memory boost over the parquet format, and the ORC format is better than parquet in the way the data is used (insert into table Select). In the final query performance can be seen, whether it is a flat wide table without nesting, or a nested table, or a multi-layered nested wide table, the query performance of the two is not much different, compared to the text format of 2 to 3 times times the elevation of about.

In addition, by comparing the test results of scene two and scene three, we can find that the flat table structure is better than the query performance of nested structure, so if you choose to use a wide table, the design of the table is flattened and the nested data is reduced as much as possible.

With these three file storage formats, the Orc file storage format is better for both spatial storage, data speed, and query speed, and the Orc can support acid operations to some extent, and community development is currently a columnstore format that is more advocated in hive, in addition, This test is primarily for the Hive engine, so it does not rule out the possibility that hive and orc sensitivity are higher than parquet.

Summarize

In this paper, the data model, file format and data access process are introduced in detail the two types of storage format--parquet and orc in the Hadoop ecosystem, and the storage and query performance are compared by the test of large data volume. For data analysis requirements in big data scenarios, the use of these two storage formats always leads to both storage and performance improvements, but the actual use of the data also requires a selection of actual information. In addition, since different open source products may have specific optimizations for different storage formats, the query engine factor should also be considered when choosing.

Parquet and ORC: high-performance Columnstore format (favorites)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.