Hive data type and file storage format

Source: Internet
Author: User

Hive Data Type underlying data type:

Tinyint,smallint,int,bigint,boolean,float,double,string,binary,timestamp,decimal,char,varchar,date.

Complex data types:

Includes Array,map,struct,union, which are composed of the underlying types.

The Array:array type is made up of a series of elements of the same data type that can be accessed by subscript. For example, there is an array type of variable fruits, which is made up of [' Apple ', ' orange ', ' Mango '], then you can access orange by fruits[1];

Map:map contains key->value key-value pairs that can be accessed by key. For example, "UserList" is a map type (where username is Key,password is value), then we can use userlist[' username ' to get this user's corresponding password;
Struct:struct can contain elements of different data types. These elements can be obtained by means of a point, for example, user is a struct type, then the user's address can be obtained by user.address.
Union:uniontype

Basic Type Conversions

Note: Because the table is relatively large, here are some long string abbreviations, TS is timestamp abbreviation, BL is a boolean abbreviation, SL is the abbreviation of smallint, DM is the abbreviation of decimal, VC is the abbreviation of varchar, BA is the abbreviation of binary.

Hive file Format file Format

Textfile: Text File

Hive default format, data is not compressed, disk overhead, data parsing cost is large.
It can be used in conjunction with GZIP, BZIP2, snappy, etc. (System Auto-check, automatic decompression during query execution), but in this way, hive does not slice the data, so the data cannot be manipulated in parallel.

Sequencefile: Binary files

Sequencefile is a binary file provided by the Hadoop API that serializes the form of data (Key,value) into a file.
This binary file is internally serialized and deserialized using the standard writable interface of Hadoop. It is compatible with the Mapfile in the Hadoop API.
The Sequencefile in Hive inherits from the sequencefile of the Hadoop API, but its key is empty, using value to hold the actual value, so as to avoid the ordering of Mr in the run map phase.

Rcfile:

Rcfile is a dedicated column-oriented data format introduced by hive. It follows the design concept of "divide first by column and then vertically".
When a query is in progress, it skips those columns on Io for columns that it does not care about. It should be noted that rcfile in the map phase from the remote copy is still copying the entire data block, and copied to the local directory,Rcfile is not really directly skip the unnecessary columns, and jump to the column to be read, but by scanning each row group header definition to achieve.
However, the head at the entire HDFs Block level does not define which row group each column starts from to which row group ends. Therefore, in the case of reading all the columns, rcfile performance is not sequencefile high.

row storage, Columnstore, and rcfile mode storage

Row Storage
Examples of HDFs block insider storage

The advantage of a Hadoop-based row storage architecture is the high adaptability of fast data loading and dynamic load, because row storage guarantees that all domains of the same record are in the same cluster node, the same HDFs block.
However, the disadvantage of row storage is also obvious, for example, it cannot support fast query processing because it cannot skip unnecessary column reads when the query is only for a few columns in a multiple list, and because of the combination of columns with different data values, row storage is not easy to obtain a very high compression ratio, that is, space utilization is not easy to greatly improve.

Column Storage
Examples of column storage in HDFS blocks

Example of storing tables in a column group on HDFs. In this example, column A and column B are stored in the same column group, and column C and column D are stored separately in a separate column group. Query-time Columnstore avoids reading unnecessary columns, and compressing similar data in a column can achieve a high compression ratio.
However, due to the higher cost of tuple reconfiguration, it is not able to provide fast query processing based on Hadoop system. Columnstore does not guarantee that all domains of the same record are stored in the same cluster node , in the row store example, the recorded 4 domains are stored in 3 HDFs blocks in different nodes.
Therefore, the refactoring of records will result in large amounts of data transfer through the cluster node network. Although it is possible to reduce overhead with multiple columns after pre-grouping, it is not very adaptable for highly dynamic load patterns.

rcfile Storage Mode
Examples of rcfile storage in HDFs blocks

Rcfile combines the space-saving features of fast and columnstore queries with row storage.
First, Rcfile guarantees that the data on the same row is on the same node, so the cost of tuple refactoring is low; second, like Columnstore, Rcfile can take advantage of data compression of column dimensions and can skip unnecessary column reads.

Expansion Interface

Default file read mode
Custom InputFormat
Custom Serde

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Hive data type and file storage format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.