HiveORC and Parquet

Source: Internet
Author: User
Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective. Currently, in open-source implementations, the most famous columnar storage engines are Parquet and ORC, and they are both Apac

Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective. Currently, in open-source implementations, the most famous columnar storage engines are Parquet and ORC, and they are both Apac

Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective.

Currently, Parquet and ORC are the most famous columns in open-source implementation. they are Apache's top-level projects and play an important role in the data storage engine.

This article will focus on ORC file storage formats. Parquet will not be explained in detail for the time being and will be sorted out later.

1,Apache Parquet

Originated from the google Dremel system, Parquet is equivalent to the data storage engine in GoogleDremel. Drill, a top-level open-source Apache project, is the open-source implementation of Dremel.

Apache Parquet was initially designed to store nested data, such as Protocolbuffer, thrift, and json. This type of data is stored in columnar format to facilitate efficient compression and encoding, in addition, Parquet uses less I/O operations to retrieve the required data, which is also an advantage of Parquet over ORC. it can transparently store Protobuf and thrift data into the determinant, in today's widespread use of Protobuf and thrift, integration with parquet is not easy and natural. In addition to the above advantages, Parquet does not have many other points that can be circled than ORC. for example, Parquet does not support update operations (data cannot be modified after being written) or ACID.

When creating a table in Hive, use the Parquet data storage format:

Create table parquet_table (id int, name string) stored as parquet;

2. Apache ORC

ORC (OptimizedRow Columnar) File format is stored in the RC (RecordColumnar File) storage format. RC is a column-based storage engine that evolves the schema (data needs to be regenerated to modify the schema) poor support, while ORC is improving RC, but it still has poor support for schema evolution, mainly in compression encoding and query performance. RC/ORC was initially used in Hive, with a sound development momentum and a separate project. Hive 1.x supports transaction and update operations based on ORC (not supported by other storage formats currently ). ORC has developed to today and has some very advanced feature, such as support for update operations, ACID, struct, and array complex types. You can use complex types to build a nested data architecture similar to parquet. However, when there are many layers, it is very troublesome and complex to write, the schema expression provided by parquet is easier to express multi-level nested data types.

When creating a table in Hive, the ORC data storage format is used:

Create table orc_table (id int, name string) stored as orc;

3. Comparison between Parquet and ORC

Parquet

Http://parquet.apache.org

Orc

Http://orc.apache.org

Development Status

Currently, it is a top-level open-source Apache project and a columnar storage engine.

Development Language

Java

Dominant companies

Twitter/Cloudera

Hortonworks

Column encoding

Supports multiple encodings, dictionaries, RLE, and Delta.

Supports mainstream encoding, similar to Parquet

ACID

Not supported

ACID transactions supported

Update and delete operations)

Not supported

Supported

Indexes supported

(Statistics)

Coarse-grained index

Block/group/chunk-level statistics

Coarse-grained index

File/stripe/row-level statistics, which cannot be accurate to column index creation

Query performance

Higher Orc performance

Compression ratio

Higher Orc compression ratio

The following figure shows the compression ratio:

4,ORC

Using the ORC file format can improve Hive's performance in reading, writing, and processing data.

An ORC file contains multipleStripes(EachStripesComposed of multiple groups of rows),AFile footer.

At the end of the filePostscriptSave the compression parameters andFooter.

OneStripesThe default size is 250 MB. the extended length is limited only by HDFS.

File footerContains a record in the fileStripes informationList, eachStripesThe number of rows and the data type of each column. It also contains column-level aggregation results: count, min, max, and sum.

By using hive -- orcfiledump to analyze ORC storage files, we can see this information:

Hive -- orcfiledump

Example:

Hive -- orcfiledump/user/hive/warehouse/helloworld. db/test_orcfile/part-00271

For Hive 1.1, you can use the following method to view the content in the ORC File:

Hive -- orcfiledump-d

Example:

Hive -- orcfiledump-d/user/hive/warehouse/helloworld. db/test_orcfile/part-00271

For more information, see the following ORC file structure:


I use the following command to output the analysis result of ORC to the orcfile file for you to view the comparison graph analysis:

Hive -- orcfiledump/user/hive/warehouse/helloworld. db/test_orcfile/part-00271> orcfile

From this, we know that in the ORC file, each Stripe includes the index Data, Row Data, and a Stripe footer.

Stripe footer contains the directory used for Stream locating, and Row data is used for table scanning.

Index Data includes the minimum and maximum values of each column, their Row numbers in each column, and Row index items (Row Index entries) the offset of the compressed block and extracted bytes is recorded. Note that the ORC index is only used to select Stripe and row group, but not to return query results. With relatively frequent row index entries, you can skip some rows for fast data reading. by default, you can skip up to 10000 rows at a time. ORC can skip a large number of rows based on filter predicates. you can use the second keyword to sort tables to reduce query execution time. For example, if the primary keyword is the transaction date, the table can be sorted by province, zip code number, or name. when querying records by province, the records of non-target provinces will be skipped.


The following describes how to use this storage format in Hive:

1) supported data formats

  • Integer
    • Boolean (1 bit)
    • Tinyint (8 bit)
    • Smallint (16 bit)
    • Int (32 bit)
    • Bigint (64 bit)
  • Floating point
    • Float
    • Double
  • String types
    • String
    • Char
    • Varchar
  • Binary blobs
    • Binary
  • Date/time
    • Timestamp
    • Date
  • Compound types
    • Struct
    • List
    • Map
    • Union

2) Hive DDL

Use orc storage format by specifying stored as ORC:

Create table orc_table (

Id int,

Name string

)Stored as orc;

You can modify the table storage format:

Alter table simple_table set fileformat orc;

If data already exists in simple_table, data cannot be accessed through table query.

3) specify the ORC storage format attribute when creating a table

KEY

DEFAULT

NOTES

Orc. compress

ZLIB

High level compression = {NONE, ZLIB, SNAPPY}

Compression Method (NONE, ZLIB, SNAPPY)

Orc. compress. size

262,144

Compression chunk size

Number of bytes per compressed block

Orc. stripe. size

268,435,456

Memory buffer size in bytes for writing

Number of bytes per stripe

Orc. row. index. stride

10,000

Number of rows between index entries

Number of rows between Indexes

Orc. create. index

TRUE

Create indexes?

Whether to create a row index

Orc. bloom. filter. columns

""

Comma separated list of column names

Orc. bloom. filter. fpp

0.05

Bloom filter false positive rate

For example, create a table without compression:

Create table orc_table (

Name STRING,

Age tinyint

) Stored as orc tblproperties ("orc. compress" = "NONE ");

4) configuration parameters of Hive involving ORC storage files

· Hive. default. fileformat

Specifies the storage file format of the table created by Hive. the default format is TextFile.

· Hive.exe c. orc. default. compress

ORC compression encoding method. the default value is ZLIB.

· Hive.exe c. orc. default. buffer. size

The buffer size of ORC. the default value is 262,144 (256KB ).

· Hive.exe c. orc. default. block. size
The system block size of the ORC file. the default value is 268,435,456 (256 MB)

· Hive.exe c. orc. zerocopy

Use zerocopy to read ORC files. Hadoop 2.3 and later versions are supported.

· Hive. orc. compute. splits. num. threads

How many threads does ORC use to create parts in parallel?

Hive.exe c. orc. skip. duplicate UPT. data false

If ORC reader encounters extends UPT data, this value will be used todetermine whether to skip the specified UPT data or throw an exception.

The default behavioris to throw an exception.

· Hive.exe c. orc. skip. duplicate UPT. data

If the ORC reads corrupted data, this option determines whether to skip the corrupted data or throw an exception.

An exception is thrown by default.

· Hive. merge. orcfile. stripe. level

When hive. merge. mapfiles, hive. merge. mapredfiles or hive. merge. when tezfiles is set to true, table data is written in the ORC file format at the same time. when this value is set to true, ORC small files are quickly merged at the stripe level.

· Other parameters are rarely used. you can refer to the instructions on the Hive Official Website for configuration and optimization.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.