Hive File Storage Format
1.textfile
Textfile is the default format
Storage mode: Row storage
Disk overhead large data resolution overhead
Compressed text file hive cannot be merged and split
2.sequencefile
binary files, serialized into a file in the form of <key,value>
Storage mode: Row storage
Divisible compression
General Selection block compression
The advantage is that the mapfile in the file and Hadoop APIs are mutually compatible.
3.rcfile
Storage mode: The data is divided into rows, blocks per block, stored in columns
Fast Fast column access compression
Read the records as much as possible involving the least block
Reading the required columns requires only reading the header definition of each row group.
The operational performance of reading full data may have no obvious advantage over Sequencefile
4.orc
Storage mode: The data is divided into rows, blocks per block, stored in columns
Fast Fast column access compression
Efficiency is higher than rcfile, it is an improved version of Rcfile
5. Custom Format
Users can customize the input and output format by implementing InputFormat and OutputFormat.
Summarize:
Textfile storage space consumption is large, and compressed text can not be split and merge queries of the lowest efficiency, direct storage, the highest speed of loading data
Sequencefile storage space consumption is the largest, compressed files can be segmented and merged query efficiency, the need to transfer through the text file to load
Rcfile storage space is minimal, query efficiency is the highest, need to transfer through the text file to load, loading the lowest speed
Personal advice: Text,seqfile can not be used to try not to use the best choice is orc