Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective. Currently, in open-source implementations, the most famous columnar storage engines are Parquet and ORC, and they are both Apac
Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective. Currently, in open-source implementations, the most famous columnar storage engines are Parquet and ORC, and they are both Apac
Compared with the row-based storage engine of traditional databases, the column-based storage engine has a higher compression ratio and less I/O operations, especially when there are many data columns, however, each operation only queries and computes several columns, and the column-based storage engine is more cost-effective.
Currently, Parquet and ORC are the most famous columns in open-source implementation. they are Apache's top-level projects and play an important role in the data storage engine.
This article will focus on ORC file storage formats. Parquet will not be explained in detail for the time being and will be sorted out later.
1,Apache Parquet
Originated from the google Dremel system, Parquet is equivalent to the data storage engine in GoogleDremel. Drill, a top-level open-source Apache project, is the open-source implementation of Dremel.
Apache Parquet was initially designed to store nested data, such as Protocolbuffer, thrift, and json. This type of data is stored in columnar format to facilitate efficient compression and encoding, in addition, Parquet uses less I/O operations to retrieve the required data, which is also an advantage of Parquet over ORC. it can transparently store Protobuf and thrift data into the determinant, in today's widespread use of Protobuf and thrift, integration with parquet is not easy and natural. In addition to the above advantages, Parquet does not have many other points that can be circled than ORC. for example, Parquet does not support update operations (data cannot be modified after being written) or ACID.
When creating a table in Hive, use the Parquet data storage format:
Create table parquet_table (id int, name string) stored as parquet;
2. Apache ORC
ORC (OptimizedRow Columnar) File format is stored in the RC (RecordColumnar File) storage format. RC is a column-based storage engine that evolves the schema (data needs to be regenerated to modify the schema) poor support, while ORC is improving RC, but it still has poor support for schema evolution, mainly in compression encoding and query performance. RC/ORC was initially used in Hive, with a sound development momentum and a separate project. Hive 1.x supports transaction and update operations based on ORC (not supported by other storage formats currently ). ORC has developed to today and has some very advanced feature, such as support for update operations, ACID, struct, and array complex types. You can use complex types to build a nested data architecture similar to parquet. However, when there are many layers, it is very troublesome and complex to write, the schema expression provided by parquet is easier to express multi-level nested data types.
When creating a table in Hive, the ORC data storage format is used:
Create table orc_table (id int, name string) stored as orc;
3. Comparison between Parquet and ORC
|
Parquet Http://parquet.apache.org |
Orc Http://orc.apache.org |
Development Status |
Currently, it is a top-level open-source Apache project and a columnar storage engine. |
Development Language |
Java |
Dominant companies |
Twitter/Cloudera |
Hortonworks |
Column encoding |
Supports multiple encodings, dictionaries, RLE, and Delta. |
Supports mainstream encoding, similar to Parquet |
ACID |
Not supported |
ACID transactions supported |
Update and delete operations) |
Not supported |
Supported |
Indexes supported (Statistics) |
Coarse-grained index Block/group/chunk-level statistics |
Coarse-grained index File/stripe/row-level statistics, which cannot be accurate to column index creation |
Query performance |
Higher Orc performance |
Compression ratio |
Higher Orc compression ratio |
The following figure shows the compression ratio:
4,ORC
Using the ORC file format can improve Hive's performance in reading, writing, and processing data.
An ORC file contains multipleStripes(EachStripesComposed of multiple groups of rows),AFile footer.
At the end of the filePostscriptSave the compression parameters andFooter.
OneStripesThe default size is 250 MB. the extended length is limited only by HDFS.
File footerContains a record in the fileStripes informationList, eachStripesThe number of rows and the data type of each column. It also contains column-level aggregation results: count, min, max, and sum.
By using hive -- orcfiledump to analyze ORC storage files, we can see this information:
Hive -- orcfiledump
Example:
Hive -- orcfiledump/user/hive/warehouse/helloworld. db/test_orcfile/part-00271
For Hive 1.1, you can use the following method to view the content in the ORC File:
Hive -- orcfiledump-d
Example:
Hive -- orcfiledump-d/user/hive/warehouse/helloworld. db/test_orcfile/part-00271
For more information, see the following ORC file structure:
I use the following command to output the analysis result of ORC to the orcfile file for you to view the comparison graph analysis:
Hive -- orcfiledump/user/hive/warehouse/helloworld. db/test_orcfile/part-00271> orcfile
From this, we know that in the ORC file, each Stripe includes the index Data, Row Data, and a Stripe footer.
Stripe footer contains the directory used for Stream locating, and Row data is used for table scanning.
Index Data includes the minimum and maximum values of each column, their Row numbers in each column, and Row index items (Row Index entries) the offset of the compressed block and extracted bytes is recorded. Note that the ORC index is only used to select Stripe and row group, but not to return query results. With relatively frequent row index entries, you can skip some rows for fast data reading. by default, you can skip up to 10000 rows at a time. ORC can skip a large number of rows based on filter predicates. you can use the second keyword to sort tables to reduce query execution time. For example, if the primary keyword is the transaction date, the table can be sorted by province, zip code number, or name. when querying records by province, the records of non-target provinces will be skipped.
The following describes how to use this storage format in Hive:
1) supported data formats
- Integer
- Boolean (1 bit)
- Tinyint (8 bit)
- Smallint (16 bit)
- Int (32 bit)
- Bigint (64 bit)
- Floating point
- String types
- Binary blobs
- Date/time
- Compound types
2) Hive DDL
Use orc storage format by specifying stored as ORC:
Create table orc_table (
Id int,
Name string
)Stored as orc;
You can modify the table storage format:
Alter table simple_table set fileformat orc;
If data already exists in simple_table, data cannot be accessed through table query.
3) specify the ORC storage format attribute when creating a table
KEY |
DEFAULT |
NOTES |
Orc. compress |
ZLIB |
High level compression = {NONE, ZLIB, SNAPPY} Compression Method (NONE, ZLIB, SNAPPY) |
Orc. compress. size |
262,144 |
Compression chunk size Number of bytes per compressed block |
Orc. stripe. size |
268,435,456 |
Memory buffer size in bytes for writing Number of bytes per stripe |
Orc. row. index. stride |
10,000 |
Number of rows between index entries Number of rows between Indexes |
Orc. create. index |
TRUE |
Create indexes? Whether to create a row index |
Orc. bloom. filter. columns |
"" |
Comma separated list of column names |
Orc. bloom. filter. fpp |
0.05 |
Bloom filter false positive rate |
For example, create a table without compression:
Create table orc_table (
Name STRING,
Age tinyint
) Stored as orc tblproperties ("orc. compress" = "NONE ");
4) configuration parameters of Hive involving ORC storage files
· Hive. default. fileformat
Specifies the storage file format of the table created by Hive. the default format is TextFile.
· Hive.exe c. orc. default. compress
ORC compression encoding method. the default value is ZLIB.
· Hive.exe c. orc. default. buffer. size
The buffer size of ORC. the default value is 262,144 (256KB ).
· Hive.exe c. orc. default. block. size
The system block size of the ORC file. the default value is 268,435,456 (256 MB)
· Hive.exe c. orc. zerocopy
Use zerocopy to read ORC files. Hadoop 2.3 and later versions are supported.
· Hive. orc. compute. splits. num. threads
How many threads does ORC use to create parts in parallel?
Hive.exe c. orc. skip. duplicate UPT. data false
If ORC reader encounters extends UPT data, this value will be used todetermine whether to skip the specified UPT data or throw an exception.
The default behavioris to throw an exception.
· Hive.exe c. orc. skip. duplicate UPT. data
If the ORC reads corrupted data, this option determines whether to skip the corrupted data or throw an exception.
An exception is thrown by default.
· Hive. merge. orcfile. stripe. level
When hive. merge. mapfiles, hive. merge. mapredfiles or hive. merge. when tezfiles is set to true, table data is written in the ORC file format at the same time. when this value is set to true, ORC small files are quickly merged at the stripe level.
· Other parameters are rarely used. you can refer to the instructions on the Hive Official Website for configuration and optimization.