Several file formats for hive __hive

Source: Internet
Author: User

Hive File Storage Format
1.textfile
Textfile is the default format
Storage mode: Row storage
Disk overhead large data resolution overhead
Compressed text file hive cannot be merged and split

2.sequencefile
binary files, serialized into a file in the form of <key,value>
Storage mode: Row storage
Divisible compression
General Selection block compression
The advantage is that the mapfile in the file and Hadoop APIs are mutually compatible.


3.rcfile
Storage mode: The data is divided into rows, blocks per block, stored in columns
Fast Fast column access compression
Read the records as much as possible involving the least block
Reading the required columns requires only reading the header definition of each row group.
The operational performance of reading full data may have no obvious advantage over Sequencefile

4.orc

Storage mode: The data is divided into rows, blocks per block, stored in columns

Fast Fast column access compression

Efficiency is higher than rcfile, it is an improved version of Rcfile

5. Custom Format
Users can customize the input and output format by implementing InputFormat and OutputFormat.


Summarize:
Textfile storage space consumption is large, and compressed text can not be split and merge queries of the lowest efficiency, direct storage, the highest speed of loading data
Sequencefile storage space consumption is the largest, compressed files can be segmented and merged query efficiency, the need to transfer through the text file to load
Rcfile storage space is minimal, query efficiency is the highest, need to transfer through the text file to load, loading the lowest speed

Personal advice: Text,seqfile can not be used to try not to use the best choice is orc

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.