Hive Orc and Parquet

Source: Internet
Author: User

The Columnstore engine has a higher compression ratio, fewer IO operations than a traditional database row storage engine, especially when there are many columns of data, but each time the query and calculation is done for only a few columns, the Columnstore engines are more cost-effective.

Currently in the open source implementation, the most famous Columnstore engine is parquet and Orc, and they are the top projects of Apache, playing an important role in the data storage engine.

This article will focus on the Orc file storage format, parquet not in-depth explanation, the subsequent extraction time collation.

1. Apache Parquet

Originating from the Google Dremel system, parquet is equivalent to the data storage engine in Googledremel, while the Apache top open source project drill is the Dremel open source implementation.

Apache Parquet was originally designed to store nested data, such as Protocolbuffer,thrift,json, to store such data in a column format to facilitate its efficient compression and encoding, and to use fewer IO operations to extract the required data. This is also the advantage of parquet compared to the Orc, it can transparently protobuf and thrift type of data for Columnstore, in Protobuf and thrift is widely used today, and parquet integration, is not easy and natural things. In addition to these advantages, parquet does not have much to say about the Orc, such as it does not support update operations (it cannot be modified after the data is written), does not support acid, and so on.

The Parquet data storage format is used when creating tables in hive:

CREATE TABLE parquet_table (ID int,name string) stored as parquet;

2. Apache ORC

The ORC (optimizedrow columnar) file format store is derived from the storage format RC (Recordcolumnar file), which is a columnstore engine with poor support for schema evolution, which modifies the schema to regenerate data. The ORC is an improvement on RC, but it still has poor support for schema evolution, mainly in compression coding, query performance optimization. The RC/ORC was initially used in Hive, and the final momentum was good, independent of being a separate project. The support of the Hive version 1.x for transactional and update operations is based on the ORC implementation (other storage formats are not supported temporarily). The ORC has evolved to today with some very advanced feature, such as support for update operations, acid support, and support for Struct,array complex types. You can use complex types to build a nested data schema similar to parquet, but when the number of layers is very long, it is cumbersome and complex to write, and the schema representation provided by Parquet makes it easier to represent a multilevel nested data type.

The ORC data storage format is used when creating tables in hive:

CREATE TABLE orc_table (ID int,name string) stored as ORC;

3. Comparison between Parquet and Orc

Parquet

http://parquet.apache.org

Orc

http://orc.apache.org

Development status

Currently Apache open source top-level project, column storage engine

Development language

Java

Leading Company

Twitter/cloudera

Hortonworks

Column encoding

Supports multiple encodings, dictionaries, rle,delta, etc.

Supports mainstream coding, similar to Parquet

ACID

Not supported

Support for ACID transactions

Modify Operation (Update,delete)

Not supported

Support

Support Index

(Statistical information)

Coarse-Grained indexes

Block/group/chunk Level Statistics

Coarse-Grained indexes

File/stripe/row level statistics, not accurate to column indexing

Query performance

A little higher ORC performance

Compression ratio

Higher ORC compression ratio

Look at a picture below, you can compare the compression ratio:

4. ORC

Using the Orc file format can improve the performance of hive read, write, and process data.

An orc file contains multiple stripes(each stripes consists of multiple sets of row data)and a file footerthat contains ancillary information.

At the end of the file, a PostScript holds the compression parameters and the length of the compressed footer .

A stripes default size is 250MB, and its size can be extended only by HDFS constraints.

The file footer contains a list of stripes information in a record, the number of rows per stripes , and the data type of each column, and it also contains the aggregated results for the column level: Count, Min, Max, and Sum.

We can see this information by using hive--orcfiledump to parse the orc storage file:

Hive--orcfiledump <path_to_file>

Example:

Hive--orcfiledump/user/hive/warehouse/helloworld.db/test_orcfile/part-00271

For Hive 1.1, the contents of the Orc file file can be viewed in the following ways:

Hive--orcfiledump-d <path_to_file>

Example:

Hive--orcfiledump-d/user/hive/warehouse/helloworld.db/test_orcfile/part-00271

From the ORC file structure below you can learn about:


I use the following command, the ORC analysis results are output to the Orcfile file, easy to view the comparison chart analysis:

Hive--orcfiledump/user/hive/warehouse/helloworld.db/test_orcfile/part-00271 > Orcfile

From this, we know that in the orc file, each stripe includes index data (indexdata), row data, and a stripe footer.

The Stripe footer contains a directory for flow targeting, and Row data is used for table scans.

Index data includes the minimum and maximum values for each column, as well as their row numbers in each column, and the row index entry (row entries) records the compression block and the offset of the extracted bytes. It is important to note that the ORC index is simply used to select stripe and row groups, not to be used to return query results. With relatively frequent row index entries, you can skip some rows for fast data reading, and you can skip up to 10000 rows at a time by default. The ORC has the ability to skip very many rows based on the filter predicate, and you can use the Second keyword to sort the table to reduce the effect of query execution time. For example, if the primary key is a trading date, the table can be sorted by province, zip code, or name, and records from non-target provinces will be skipped when records are queried in the province.


The following describes how to Hive This storage format is used in:

1) Supported data formats

    • Integer
      • Boolean (1 bit)
      • tinyint (8 bit)
      • smallint (+ bit)
      • Int (+ bit)
      • BigInt (+ bit)
    • Floating point
      • Float
      • Double
    • String types
      • String
      • Char
      • varchar
    • Binary BLOBs
      • Binary
    • Date/time
      • Timestamp
      • Date
    • Compound types
      • struct
      • List
      • Map
      • Union

2) Hive DDL

Use the ORC storage format by specifying the stored as ORC:

CREATE TABLE Orc_table (

ID int,

Name string

) stored as ORC;

You can modify the storage format of a table:

ALTER TABLE simple_table set FileFormat orc;

If simple_table already exists, it will cause data to be inaccessible through a table query.

3) When creating a table, specify the ORC storage format property

KEY

DEFAULT

NOTES

Orc.compress

ZLIB

High level compression = {NONE, ZLIB, SNAPPY}

Compression method (NONE, ZLIB, SNAPPY)

Orc.compress.size

262,144

Compression chunk Size

Number of bytes per compressed block

Orc.stripe.size

268,435,456

Memory buffer size in bytes for writing

Number of bytes per stripe

Orc.row.index.stride

10,000

Number of rows between index entries

Number of rows between index entries

Orc.create.index

TRUE

Create indexes?

Whether to create a row index

Orc.bloom.filter.columns

""

Comma separated list of column names

Orc.bloom.filter.fpp

0.05

Bloom Filter False Positive rate

For example, create a table without compression:

CREATE TABLE orc_table (

Name STRING,

Age tinyint

) STORED as ORC tblproperties ("orc.compress" = "NONE");

4) hive involves the configuration parameters of the Orc storage file

· Hive.default.fileformat

Specifies the storage file format for hive creation table, which defaults to Textfile.

· Hive.exec.orc.default.compress

The orc's compression encoding, by default, is zlib.

· Hive.exec.orc.default.buffer.size

The buffer size of the ORC, which defaults to 262,144 (256KB).

· Hive.exec.orc.default.block.size
The system block size of the orc file, which defaults to 268,435,456 (256MB)

· Hive.exec.orc.zerocopy

Use Zerocopy to read orc files. Support for Hadoop 2.3 and subsequent releases.

· Hive.orc.compute.splits.num.threads

How many threads the orc uses to parallelize the creation of shards

Hive.exec.orc.skip.corrupt.data false

If ORC reader encounters corrupt data, this value would be used todetermine whether to skip the corrupt data or throw an ex Ception.

The default Behavioris to throw a exception.

· Hive.exec.orc.skip.corrupt.data

If the ORC encounters corrupted data while reading, this option determines whether to skip the corrupted data or throw an exception.

The default is to throw an exception.

· Hive.merge.orcfile.stripe.level

When Hive.merge.mapfiles, When Hive.merge.mapredfiles or Hive.merge.tezfiles is set to true, at this time the table data is written in the Orc file format, and setting this value to True will quickly merge the Orc small file at the stripe level.

· Other parameters are rarely used, you can refer to the hive website for configuration and tuning.

Hive Orc and Parquet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.