HiveORC and Parquet

Last Update:2017-05-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective. Currently, in open-source implementations, the most famous columnar storage engines are Parquet and ORC, and they are both Apac

Currently, Parquet and ORC are the most famous columns in open-source implementation. they are Apache's top-level projects and play an important role in the data storage engine.

This article will focus on ORC file storage formats. Parquet will not be explained in detail for the time being and will be sorted out later.

1,Apache Parquet

Originated from the google Dremel system, Parquet is equivalent to the data storage engine in GoogleDremel. Drill, a top-level open-source Apache project, is the open-source implementation of Dremel.

Apache Parquet was initially designed to store nested data, such as Protocolbuffer, thrift, and json. This type of data is stored in columnar format to facilitate efficient compression and encoding, in addition, Parquet uses less I/O operations to retrieve the required data, which is also an advantage of Parquet over ORC. it can transparently store Protobuf and thrift data into the determinant, in today's widespread use of Protobuf and thrift, integration with parquet is not easy and natural. In addition to the above advantages, Parquet does not have many other points that can be circled than ORC. for example, Parquet does not support update operations (data cannot be modified after being written) or ACID.

When creating a table in Hive, use the Parquet data storage format:

Create table parquet_table (id int, name string) stored as parquet;

2. Apache ORC

ORC (OptimizedRow Columnar) File format is stored in the RC (RecordColumnar File) storage format. RC is a column-based storage engine that evolves the schema (data needs to be regenerated to modify the schema) poor support, while ORC is improving RC, but it still has poor support for schema evolution, mainly in compression encoding and query performance. RC/ORC was initially used in Hive, with a sound development momentum and a separate project. Hive 1.x supports transaction and update operations based on ORC (not supported by other storage formats currently ). ORC has developed to today and has some very advanced feature, such as support for update operations, ACID, struct, and array complex types. You can use complex types to build a nested data architecture similar to parquet. However, when there are many layers, it is very troublesome and complex to write, the schema expression provided by parquet is easier to express multi-level nested data types.

When creating a table in Hive, the ORC data storage format is used:

Create table orc_table (id int, name string) stored as orc;

3. Comparison between Parquet and ORC

	Parquet Http://parquet.apache.org	Orc Http://orc.apache.org
Development Status	Currently, it is a top-level open-source Apache project and a columnar storage engine.
Development Language	Java
Dominant companies	Twitter/Cloudera	Hortonworks
Column encoding	Supports multiple encodings, dictionaries, RLE, and Delta.	Supports mainstream encoding, similar to Parquet
ACID	Not supported	ACID transactions supported
Update and delete operations)	Not supported	Supported
Indexes supported (Statistics)	Coarse-grained index Block/group/chunk-level statistics	Coarse-grained index File/stripe/row-level statistics, which cannot be accurate to column index creation
Query performance	Higher Orc performance
Compression ratio	Higher Orc compression ratio

The following figure shows the compression ratio:

4,ORC

Using the ORC file format can improve Hive's performance in reading, writing, and processing data.

An ORC file contains multipleStripes(EachStripesComposed of multiple groups of rows),AFile footer.

At the end of the filePostscriptSave the compression parameters andFooter.

OneStripesThe default size is 250 MB. the extended length is limited only by HDFS.

File footerContains a record in the fileStripes informationList, eachStripesThe number of rows and the data type of each column. It also contains column-level aggregation results: count, min, max, and sum.

By using hive -- orcfiledump to analyze ORC storage files, we can see this information:

Hive -- orcfiledump

Example:

Hive -- orcfiledump/user/hive/warehouse/helloworld. db/test_orcfile/part-00271

For Hive 1.1, you can use the following method to view the content in the ORC File:

Hive -- orcfiledump-d

Example:

Hive -- orcfiledump-d/user/hive/warehouse/helloworld. db/test_orcfile/part-00271

For more information, see the following ORC file structure:

I use the following command to output the analysis result of ORC to the orcfile file for you to view the comparison graph analysis:

Hive -- orcfiledump/user/hive/warehouse/helloworld. db/test_orcfile/part-00271> orcfile

From this, we know that in the ORC file, each Stripe includes the index Data, Row Data, and a Stripe footer.

Stripe footer contains the directory used for Stream locating, and Row data is used for table scanning.

Index Data includes the minimum and maximum values of each column, their Row numbers in each column, and Row index items (Row Index entries) the offset of the compressed block and extracted bytes is recorded. Note that the ORC index is only used to select Stripe and row group, but not to return query results. With relatively frequent row index entries, you can skip some rows for fast data reading. by default, you can skip up to 10000 rows at a time. ORC can skip a large number of rows based on filter predicates. you can use the second keyword to sort tables to reduce query execution time. For example, if the primary keyword is the transaction date, the table can be sorted by province, zip code number, or name. when querying records by province, the records of non-target provinces will be skipped.

The following describes how to use this storage format in Hive:

1) supported data formats

Integer
- Boolean (1 bit)
- Tinyint (8 bit)
- Smallint (16 bit)
- Int (32 bit)
- Bigint (64 bit)
Floating point

Float
Double

String types
- String
- Char
- Varchar
Binary blobs
- Binary
Date/time
- Timestamp
- Date
Compound types
- Struct
- List
- Map
- Union

2) Hive DDL

Use orc storage format by specifying stored as ORC:

Create table orc_table (

Id int,

Name string

)Stored as orc;

You can modify the table storage format:

Alter table simple_table set fileformat orc;

If data already exists in simple_table, data cannot be accessed through table query.

3) specify the ORC storage format attribute when creating a table

KEY	DEFAULT	NOTES
Orc. compress	ZLIB	High level compression = {NONE, ZLIB, SNAPPY} Compression Method (NONE, ZLIB, SNAPPY)
Orc. compress. size	262,144	Compression chunk size Number of bytes per compressed block
Orc. stripe. size	268,435,456	Memory buffer size in bytes for writing Number of bytes per stripe
Orc. row. index. stride	10,000	Number of rows between index entries Number of rows between Indexes
Orc. create. index	TRUE	Create indexes? Whether to create a row index
Orc. bloom. filter. columns	""	Comma separated list of column names
Orc. bloom. filter. fpp	0.05	Bloom filter false positive rate

For example, create a table without compression:

Create table orc_table (

Name STRING,

Age tinyint

) Stored as orc tblproperties ("orc. compress" = "NONE ");

4) configuration parameters of Hive involving ORC storage files

· Hive. default. fileformat

Specifies the storage file format of the table created by Hive. the default format is TextFile.

· Hive.exe c. orc. default. compress

ORC compression encoding method. the default value is ZLIB.

· Hive.exe c. orc. default. buffer. size

The buffer size of ORC. the default value is 262,144 (256KB ).

· Hive.exe c. orc. default. block. size
The system block size of the ORC file. the default value is 268,435,456 (256 MB)

· Hive.exe c. orc. zerocopy

Use zerocopy to read ORC files. Hadoop 2.3 and later versions are supported.

· Hive. orc. compute. splits. num. threads

How many threads does ORC use to create parts in parallel?

Hive.exe c. orc. skip. duplicate UPT. data false

If ORC reader encounters extends UPT data, this value will be used todetermine whether to skip the specified UPT data or throw an exception.

The default behavioris to throw an exception.

· Hive.exe c. orc. skip. duplicate UPT. data

If the ORC reads corrupted data, this option determines whether to skip the corrupted data or throw an exception.

An exception is thrown by default.

· Hive. merge. orcfile. stripe. level

When hive. merge. mapfiles, hive. merge. mapredfiles or hive. merge. when tezfiles is set to true, table data is written in the ORC file format at the same time. when this value is set to true, ORC small files are quickly merged at the stripe level.

· Other parameters are rarely used. you can refer to the instructions on the Hive Official Website for configuration and optimization.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HiveORC and Parquet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HiveORC and Parquet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support