Hive Learning Path (vi) data type and storage format for hive SQL

Source: Internet
Author: User

I. Data type 1, basic data type

Hive supports most basic data types in relational data

tr>
type Description Example
boolean true/false true
tinyint 1-byte signed integer -128~127 1Y
smallint 2-byte signed integer, -32768~32767 1 s
int 4-byte signed integer 1
bigint 8-byte signed integer 1 l
float 4-byte single-precision floating-point number 1.0
double 8-byte double-precision floating-point number 1.0
Deicimal signed decimals with arbitrary precision 1.0
string string, variable length "A", ' B '
varchar variable-length word Character string "A", ' B '
char fixed-length string "A", ' B '
binary byte array cannot represent
timestamp timestamp, nanosecond precision 122327493795
Date date ' 2018-04-07 '

As with other SQL languages, these are reserved words. It is important to note that all of these data types are implementations of the interfaces in Java, so the specific behavior details of these types are exactly the same as the corresponding types in Java. For example, the string type implements a String,float in Java that implements float in Java, and so on.

2. Complex Type
type Description Example
Array An orderly set of similar types. Array (from)
Map Key-value,key must be of the original type, value can be any type Map (' A ', 1, ' B ', 2)
struct Field collection, type can be different struct (' 1 ', 1,1.0), Named_stract (' col1 ', ' 1 ', ' col2 ', 1, ' ClO3 ', 1.0)
Second, storage format

Hive creates a directory on HDFS for each database that is created, and the table is stored as a subdirectory, and the data in the table is stored as a file in the table directory. The default database does not have its own directory, and the default database table is stored in the/user/hive/warehouse directory.

(1) Textfile

Textfile is the default format and is stored as a row store. Data is not compressed, disk overhead is large, data parsing cost is large.

(2) Sequencefile

Sequencefile is a binary file support provided by the Hadoop API, which is easy to use, can be segmented, and compressible.

Sequencefile supports three types of compression options: NONE, RECORD, BLOCK. The record compression rate is low, it is generally recommended to use block compression.

(3) Rcfile

A combination of row and column storage methods.

(4) Orcfile

Data is divided by row, and each block is stored in columns, where each block is stored with an index. The new format given by Hive, which belongs to the upgraded version of Rcfile, has a significant improvement in performance, and the data can be compressed, compressed and quickly accessed.

(5) Parquet

Parquet is also a row-type store with good compression performance while reducing the amount of time it takes to scan and deserialize a large number of tables.

Third, the data format

When the data is stored in a text file, the rows and columns must be distinguished by a certain format, and the delimiters are indicated in hive. Hive uses a number of characters that are seldom present by default, and these characters generally do not appear as content in the record.

The default row and column separators for Hive are shown in the following table.

Separators Description
\ n For a text file, each line is a record, so \ n to split the record
^a (Ctrl + a) You can also use \001 to represent a field
^b (CTRL+B) Used to split elements in a arrary or Struct, or to divide between key values in a map, or to split them with \002.
^c Used in the map to split the key and the value itself, can also be expressed in \003.

Hive Learning Path (vi) data type and storage format for hive SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.