Parquet and ORC Performance test report

Source: Internet
Author: User
Tags joins

I. Environmental description

Hadoop cluster: Using the test Hadoop cluster, node:

hadoop230hadoop231hadoop232hadoop233

These machine configuration, the specific parameters can be referred to as follows:
Number of CPUs: 2
Number of CPU Threads: 32
Memory: 128GB
Disk: 48TB

With the same queue on the test fleet, all queries are non-concurrent, using the entire cluster's resources.

Hive uses the official hive 1.2.1 version, launched using the Hiveserver2 method, using native MySQL storage metadata.

Second, test data generation

Test data for Tpc-ds benchmark data, Official document: Http://www.tpc.org/information/current_ Specifications.asp, this dataset has a total of 24 tables: 7 fact tables, 17 dimension tables, each fact table and most of the dimension tables make up the Snowflake model, Scale_factor is set to 100, which is the generation of 100GB of data.

2.1 Downloads Hive-testbench

git clone https://github.com/hortonworks/hive-testbench

This project is used to generate the Tpc-ds dataset and import it into hive, and it is necessary to ensure that the commands such as Hive, Hadoop, and so on are added to the path before use.

2.2 Compiling

Enter the directory to execute./tpcds-build.sh, which will download the source code from Tpc-ds, compile, initialize Metastore, and prepare for importing data to hive.

2.3 Importing Data

Import ORC data:./tpcds-setup.sh 100 The default file format is orc, so you don't need to specify a storage format.
Import Parquet data: Format=parquet./tpcds-setup.sh 100 Specify a file format of parquet
The parameter 100 is to generate 100GB of data, the program will first generate text format data into a temporary directory, and then convert the data into ORC or parquet format.

2.4 Performing Tests

The project in sample-queries-tpcds/ Some hive configurations are given in the Testbench.settings file, set these parameters before starting hive, give 65 test SQL in the Sample-queries-tpcds directory, this is part of the TPC-DS query that can run on hive, and by referring to previous tests, select 50 of these queries are used for this test.

This test mainly compares the query performance between the Orc file format and the Parquet file format, and compares it to the uncompressed text format. Hive runs through the hiveserver2 background, where the client executes each query in JDBC, comparing the wall time used by the query (the timestamp at the end of each query minus the timestamp before the query).

2.5 Test Data Model

The data used in the test is a snowflake model in the Tpc-ds dataset that store_sales the fact table, and different scenarios construct different table structures based on the data set.

Third, the test procedure
    • Input parameters are the list of databases to be tested, SQL to be measured, and loop round N
    • Repeat the N-round view, before each round begins to create a connection for each database, and then execute each sql,* execution sequence to be measured for the DBA to perform Query1, DBB for Query1, Hiveserver2 Perform query1 on DBC, perform query2 on DBA ... Because each data is stored in a different directory, you can avoid the effects of other factors such as caching.
    • Re-destroy the previous connection at the end of each round and repeat 2.
    • Time statistics: Counts the time spent executing each SQL.
    • Comparison of storage formats: Text format, orc format, and parquet format
    • All queries are non-concurrent and prevent the impact of queries
Scene setting Scene One: Multiple fact tables, multiple dimension tables, complex join queries

You do not need to modify any data to execute the query directly on the Tpc-ds dataset. The text table is not a partitioned table (100 reducer Generation), the other two storage formats of the fact table is partitioned according to data_id partition table, performed two tests: 1, the execution of all 50 queries, 2, select the matching test data Model query, respectively, QUERY27, Query7,query28,query67,query82,query42,query43,query46,query73,query96, executes n This query, calculates the average.

Scenario two: A wide table that is generated after the dimension table and fact table join, and queries are made on only one table.

Select Store_sales, Household_demographics, Customer_address, Date_dim, and store tables in the data model to generate a flat wide table (store_sales_wide_table), Based on this table execution query, because the scene one of the selected query most can not match to this wide table, so using some new query SQL, table model, load data SQL and query can refer to the attachment.

Scenario Three: Complex data structure consisting of a wide table, struct, list, map, etc. (1 layers)

On the basis of scene two, the dimension table (except the Store_sales table) is converted to a struct or map object, and the fields in the Source Store_sales table remain unchanged. Generate a new table (store_sales_wide_table_one_nested) with a nested layer, using the same query logic as in scenario two, and modify the SQL to meet the new table structure. The table model, the SQL for the load data, and the query can refer to the attachment.

Scenario four: Complex data structures, multi-layered nesting. (3 floor)

On the basis of scenario three, the fields in the struct of some dimension tables are converted into a struct or map object, only the nested map in the struct is present, the deepest nesting is three layers. Create a new, multi-layered nested table (store_sales_wide_table_more_nested) that uses the same query logic as in scenario three, modifying SQL to meet the new table structure. The table model, the SQL for the load data, and the query can refer to the attachment.

V. Test results

In each scenario, the comparison test mainly focuses on two aspects: 1, the size of the test data to compare the different file formats in the storage of the merits and demerits, 2, the same scene in different file format query time comparison. In addition, in scene two, scene three, scene four in the new table to guide the speed of the data for the orc > Parquet > Text

Scene One

The scenario is designed to all tables in the Tpc-ds, with the Store_sales table as an example, the table records: 287,997,024, the file size is:

    • Original text format, uncompressed: 38.1 G
    • Orc format, default compression (ZLIB), altogether 1800+ partitions: 11.5 G
    • Parquet format, default compression (Snappy?), total 1800+ partitions: 14.8 G

Test 1 (Perform all 50 queries) results:

Description: Original text Format difference is not big, even some better?! The possible reason is that in the case of multiple table joins, each query requires up to 10 + Mr Task processing, and the queries between these query file storage formats are mainly found in the first few jobs that read the original table records, and the subsequent tasks do not vary greatly, resulting in reduced query time differences. As you can see from the test results of a single wide table later, the performance differences between text and the other two columnstore formats are quite large.

Test 2 (Execute 10 of these queries) results:

Description: And test 1 Similar, the difference is not big, for the QUERY82,ORC format is much worse?!

Scene Two

In this scenario, only a wide table is involved, and there are no partition fields, Store_sales_wide_table table records: 263,704,266, table size:

    • Original text format, uncompressed: 149.0 G
    • Orc format, default compression: 10.6 G is smaller than the Store_sales table?
    • Parquet format, default compression: 12.5 G is smaller than the Store_sales table?

Query test results:

Scene Three

This scenario involves only a nested wide table with no partition fields, store_sales_wide_table_one_nested table records: 263,704,266, table size:

    • Original text format, uncompressed: 245.3 G
    • Orc format, default compression: 10.9 G is smaller than the Store_sales table?
    • Parquet format, default compression: 29.8 G

Query test results:

Scene Four

This scenario involves only a multi-layered nested wide table with no partition fields, store_sales_wide_table_more_nested table records: 263,704,266, table size:

    • Original text format, uncompressed: 222.7 G
    • Orc format, default compression: 10.9 G is smaller than the Store_sales table?
    • Parquet format, default compression: 23.1 G smaller than a layer of nested table store_sales_wide_table_one_nested?

Query test results:


Vi. Summary

This article uses hive to compare three different file storage formats--text, Orc, and parquet, and from the test results, the star model is not very suitable for data analysis scenarios, and joins of multiple tables can slow down the query speed greatly. And does not take advantage of the performance improvement of Columnstore, in the case of wide tables, the Orc file format is much better than the text format in the storage space, compared to the parquet format has a storage space increase, in the Guide data (insert into table Select In this way, the ORC format is also better than parquet, in the final query performance can be seen, whether it is not nested flat wide table, or a layer of nested table, or a multi-layered nested wide table, the query performance of the two is not much different, compared to the text format has about 2 to 3 times times the promotion, in Query_ Join2 in this query, the ORC is superior to the Parquet format when executing a join query using a wide table and another dimension table.

In addition, by comparing the test results of scene two and scene three, we can find that the flat table structure is better than the query performance of nested structure, so if you choose to use a wide table, the design of the table is flattened and the nested data is reduced as much as possible.

With these three file storage formats, the Orc file storage format is better for both spatial storage, data speed, and query speed, and the Orc can support acid operations to some extent, and community development is currently a columnstore format that is more advocated in hive, in addition, This test is primarily for the Hive engine, so it does not rule out the possibility that hive and orc sensitivity are higher than parquet.

Parquet and ORC Performance test report

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.