Hive-based file format: RCFile introduction and application

Source: Internet
Author: User

Hive-based file format: RCFile introduction and application

Directory

1. Introduction to Hadoop file formats
(1) SequenceFile
(2) RCFile
(3) Avro
(4) Text Format
(5) External format
2. Why is RCFile required?
3. RCFile Introduction
4. What method should I use later than RCFile?
5. How to generate an RCFile File
(1) insert conversion using textfile directly in hive
(2) generated through mapreduce
6. Refer:

As an open-source MR implementation, Hadoop has always been able to dynamically run the parsing file format and enjoy the advantage of loading speed several times faster than that of MPP databases. However, the MPP database community has always criticized Hadoop for its high serialization and deserialization costs because file formats are not created for specific purposes.

 

1. Introduction to hadoop file formats

Currently, the popular file formats in hadoop are as follows:

 

(1) SequenceFile

SequenceFile is a binary file provided by Hadoop API. It serializes data to a file in the form of <key, value>. This type of binary file uses Hadoop's standard Writable interface to implement serialization and deserialization. It is compatible with MapFile in Hadoop APIs. SequenceFile in Hive inherits from the SequenceFile of Hadoop API. However, its key is empty and the value is used to store the actual value. This is to avoid the MR sorting process in the map stage. If you use Java APIs to write SequenceFile and read Hive data, make sure that the value field is used to store data. Otherwise, you need to customize the InputFormat class and OutputFormat class that read the SequenceFile.


 

(2) RCFile

RCFile is a column-oriented data format launched by Hive. It follows the design philosophy of "first partitioning by column and then vertical partitioning. When a column is not concerned during the query, it skips these columns on IO. It should be noted that in the map stage, the entire data block is still copied from the remote end, and after the data block is copied to the local directory, RCFile does not directly skip unnecessary columns, and jump to the column to be read, but by scanning the head definition of each row group, however, the headers at the entire HDFS Block level do not define the row group from which each column starts to which row group ends. Therefore, when reading all columns, RCFile has no higher performance than SequenceFile.

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hive tutorial

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Example of exclusive HDFS Block Storage

Example of HDFS block-based column Storage

Example of RCFile storage in HDFS Blocks

 

(3) Avro

Avro is a data-intensive binary file format. Its file format is more compact. to read a large amount of data, Avro can provide better serialization and deserialization performance. In addition, Avro data files are inherently defined with Schema, so they do not require developers to implement their own Writable objects at the API level. Recently, multiple Hadoop sub-projects support Avro data formats, such as Pig, Hive, Flume, Sqoop, and Hcatalog.


 

(4) Text Format

In addition to the three binary formats mentioned above, data in the text format is also frequently encountered in Hadoop. Such as TextFile, XML, and JSON. In addition to occupying more disk resources, the text parsing overhead is generally dozens of times higher than the binary format, especially XML and JSON, their parsing overhead is higher than Textfile, so it is strongly not recommended to use these formats for storage in the production system. If you need to output these formats, perform the conversion operation on the client. The text format is often used for log collection, database import, and the default Hive configuration also uses the text format, and it is easy to forget to compress, so please ensure that the correct format is used. Another disadvantage of text format is that it does not have the type and mode, such as sales amount, profit, or date and time data. If it is saved in text format, because their own string types are different or contain negative numbers, MR cannot be sorted, so it is often necessary to pre-process them into binary format containing the pattern, this leads to unnecessary preprocessing steps and waste of storage resources.

 

(5) External format

Hadoop actually supports any file format, as long as it can implement the corresponding RecordWriter and RecordReader. The database format is often stored in Hadoop, such as Hbase, Mysql, Cassandra, and MongoDB. These formats are generally used to avoid the need for a large amount of data movement and fast loading. Their serialization and deserialization are completed by clients in these database formats, and the storage location and Data Layout of files are not controlled by Hadoop, their file splitting is not based on the block size of HDFS.

For more details, please continue to read the highlights on the next page:

  • 1
  • 2
  • 3
  • Next Page

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.