Hive Orc and Parquet

Last Update:2016-05-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Columnstore engine has a higher compression ratio, fewer IO operations than a traditional database row storage engine, especially when there are many columns of data, but each time the query and calculation is done for only a few columns, the Columnstore engines are more cost-effective.

Currently in the open source implementation, the most famous Columnstore engine is parquet and Orc, and they are the top projects of Apache, playing an important role in the data storage engine.

This article will focus on the Orc file storage format, parquet not in-depth explanation, the subsequent extraction time collation.

1. Apache Parquet

Originating from the Google Dremel system, parquet is equivalent to the data storage engine in Googledremel, while the Apache top open source project drill is the Dremel open source implementation.

Apache Parquet was originally designed to store nested data, such as Protocolbuffer,thrift,json, to store such data in a column format to facilitate its efficient compression and encoding, and to use fewer IO operations to extract the required data. This is also the advantage of parquet compared to the Orc, it can transparently protobuf and thrift type of data for Columnstore, in Protobuf and thrift is widely used today, and parquet integration, is not easy and natural things. In addition to these advantages, parquet does not have much to say about the Orc, such as it does not support update operations (it cannot be modified after the data is written), does not support acid, and so on.

The Parquet data storage format is used when creating tables in hive:

CREATE TABLE parquet_table (ID int,name string) stored as parquet;

2. Apache ORC

The ORC (optimizedrow columnar) file format store is derived from the storage format RC (Recordcolumnar file), which is a columnstore engine with poor support for schema evolution, which modifies the schema to regenerate data. The ORC is an improvement on RC, but it still has poor support for schema evolution, mainly in compression coding, query performance optimization. The RC/ORC was initially used in Hive, and the final momentum was good, independent of being a separate project. The support of the Hive version 1.x for transactional and update operations is based on the ORC implementation (other storage formats are not supported temporarily). The ORC has evolved to today with some very advanced feature, such as support for update operations, acid support, and support for Struct,array complex types. You can use complex types to build a nested data schema similar to parquet, but when the number of layers is very long, it is cumbersome and complex to write, and the schema representation provided by Parquet makes it easier to represent a multilevel nested data type.

The ORC data storage format is used when creating tables in hive:

CREATE TABLE orc_table (ID int,name string) stored as ORC;

3. Comparison between Parquet and Orc

	Parquet http://parquet.apache.org	Orc http://orc.apache.org
Development status	Currently Apache open source top-level project, column storage engine
Development language	Java
Leading Company	Twitter/cloudera	Hortonworks
Column encoding	Supports multiple encodings, dictionaries, rle,delta, etc.	Supports mainstream coding, similar to Parquet
ACID	Not supported	Support for ACID transactions
Modify Operation (Update,delete)	Not supported	Support
Support Index (Statistical information)	Coarse-Grained indexes Block/group/chunk Level Statistics	Coarse-Grained indexes File/stripe/row level statistics, not accurate to column indexing
Query performance	A little higher ORC performance
Compression ratio	Higher ORC compression ratio

Look at a picture below, you can compare the compression ratio:

4. ORC

Using the Orc file format can improve the performance of hive read, write, and process data.

An orc file contains multiple stripes(each stripes consists of multiple sets of row data)and a file footerthat contains ancillary information.

At the end of the file, a PostScript holds the compression parameters and the length of the compressed footer .

A stripes default size is 250MB, and its size can be extended only by HDFS constraints.

The file footer contains a list of stripes information in a record, the number of rows per stripes , and the data type of each column, and it also contains the aggregated results for the column level: Count, Min, Max, and Sum.

We can see this information by using hive--orcfiledump to parse the orc storage file:

Hive--orcfiledump <path_to_file>

Example:

Hive--orcfiledump/user/hive/warehouse/helloworld.db/test_orcfile/part-00271

For Hive 1.1, the contents of the Orc file file can be viewed in the following ways:

Hive--orcfiledump-d <path_to_file>

Example:

Hive--orcfiledump-d/user/hive/warehouse/helloworld.db/test_orcfile/part-00271

From the ORC file structure below you can learn about:

I use the following command, the ORC analysis results are output to the Orcfile file, easy to view the comparison chart analysis:

Hive--orcfiledump/user/hive/warehouse/helloworld.db/test_orcfile/part-00271 > Orcfile

From this, we know that in the orc file, each stripe includes index data (indexdata), row data, and a stripe footer.

The Stripe footer contains a directory for flow targeting, and Row data is used for table scans.

Index data includes the minimum and maximum values for each column, as well as their row numbers in each column, and the row index entry (row entries) records the compression block and the offset of the extracted bytes. It is important to note that the ORC index is simply used to select stripe and row groups, not to be used to return query results. With relatively frequent row index entries, you can skip some rows for fast data reading, and you can skip up to 10000 rows at a time by default. The ORC has the ability to skip very many rows based on the filter predicate, and you can use the Second keyword to sort the table to reduce the effect of query execution time. For example, if the primary key is a trading date, the table can be sorted by province, zip code, or name, and records from non-target provinces will be skipped when records are queried in the province.

The following describes how to Hive This storage format is used in:

1) Supported data formats

Integer
- Boolean (1 bit)
- tinyint (8 bit)
- smallint (+ bit)
- Int (+ bit)
- BigInt (+ bit)
Floating point

Float
Double

String types
- String
- Char
- varchar
Binary BLOBs
- Binary
Date/time
- Timestamp
- Date
Compound types
- struct
- List
- Map
- Union

2) Hive DDL

Use the ORC storage format by specifying the stored as ORC:

CREATE TABLE Orc_table (

ID int,

Name string

) stored as ORC;

You can modify the storage format of a table:

ALTER TABLE simple_table set FileFormat orc;

If simple_table already exists, it will cause data to be inaccessible through a table query.

3) When creating a table, specify the ORC storage format property

KEY	DEFAULT	NOTES
Orc.compress	ZLIB	High level compression = {NONE, ZLIB, SNAPPY} Compression method (NONE, ZLIB, SNAPPY)
Orc.compress.size	262,144	Compression chunk Size Number of bytes per compressed block
Orc.stripe.size	268,435,456	Memory buffer size in bytes for writing Number of bytes per stripe
Orc.row.index.stride	10,000	Number of rows between index entries Number of rows between index entries
Orc.create.index	TRUE	Create indexes? Whether to create a row index
Orc.bloom.filter.columns	""	Comma separated list of column names
Orc.bloom.filter.fpp	0.05	Bloom Filter False Positive rate

For example, create a table without compression:

CREATE TABLE orc_table (

Name STRING,

Age tinyint

) STORED as ORC tblproperties ("orc.compress" = "NONE");

4) hive involves the configuration parameters of the Orc storage file

· Hive.default.fileformat

Specifies the storage file format for hive creation table, which defaults to Textfile.

· Hive.exec.orc.default.compress

The orc's compression encoding, by default, is zlib.

· Hive.exec.orc.default.buffer.size

The buffer size of the ORC, which defaults to 262,144 (256KB).

· Hive.exec.orc.default.block.size
The system block size of the orc file, which defaults to 268,435,456 (256MB)

· Hive.exec.orc.zerocopy

Use Zerocopy to read orc files. Support for Hadoop 2.3 and subsequent releases.

· Hive.orc.compute.splits.num.threads

How many threads the orc uses to parallelize the creation of shards

Hive.exec.orc.skip.corrupt.data false

If ORC reader encounters corrupt data, this value would be used todetermine whether to skip the corrupt data or throw an ex Ception.

The default Behavioris to throw a exception.

· Hive.exec.orc.skip.corrupt.data

If the ORC encounters corrupted data while reading, this option determines whether to skip the corrupted data or throw an exception.

The default is to throw an exception.

· Hive.merge.orcfile.stripe.level

When Hive.merge.mapfiles, When Hive.merge.mapredfiles or Hive.merge.tezfiles is set to true, at this time the table data is written in the Orc file format, and setting this value to True will quickly merge the Orc small file at the stripe level.

· Other parameters are rarely used, you can refer to the hive website for configuration and tuning.

Hive Orc and Parquet

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Orc and Parquet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive Orc and Parquet

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support