Hive-based file format: Rcfile introduction and its application

Source: Internet
Author: User
Tags hortonworks hadoop ecosystem

Reprinted from: https://my.oschina.net/leejun2005/blog/280896

Hadoop, as an open-source implementation of Mr, has been an advantage of dynamically running parsing file formats and getting loads that are several times faster than the MPP database. However, the MPP database community has also been criticizing Hadoop because the file format is not built for a specific purpose, so the cost of serialization and deserialization is too high.

1. Introduction to Hadoop file format

There are several types of file formats prevalent in Hadoop today:

(1) Sequencefile

Sequencefile is a binary file provided by the Hadoop API that serializes data into files in the form of <key,value>. This binary file is internally serialized and deserialized using the standard writable interface of Hadoop. It is compatible with the Mapfile in the Hadoop API. The Sequencefile in Hive inherits from the sequencefile of the Hadoop API, but its key is empty, using value to hold the actual value, so as to avoid the ordering of Mr in the run map phase. If you write Sequencefile with the Java API and have hive read it, be sure to use the Value field to hold the data, otherwise you will need to customize the InputFormat class and OutputFormat to read this sequencefile Class

(2) Rcfile

Rcfile is a dedicated column-oriented data format introduced by hive. It follows the design concept of "divide first by column and then vertically". When a query is in progress, it skips those columns on Io for columns that it does not care about. It should be noted that rcfile in the map phase from the remote copy is still copying the entire data block, and copied to the local directory after the Rcfile is not really directly skip the unnecessary columns, and jump to the column to be read, but by scanning each row group's header definition to achieve, However, the head at the entire HDFs Block level does not define which row group each column starts from to which row group ends. Therefore, in the case of reading all the columns, rcfile performance is not sequencefile high.

Examples of HDFs block insider storage

Examples of column storage in HDFS blocks

Examples of rcfile stored in HDFS blocks

(3) Avro

Avro is a binary file format for supporting data-intensive. Its file format is more compact, and Avro can provide better serialization and deserialization performance when it comes to reading large amounts of data. And Avro data files are inherently schema-defined, so it doesn't require developers to implement their own writable objects at the API level. More recent Hadoop sub-projects support Avro data formats such as pig, Hive, Flume, Sqoop, and Hcatalog.

(4) Text Format

In addition to the 3 binary formats mentioned above, data in text format is often encountered in Hadoop. such as Textfile, XML, and JSON. In addition to using more disk resources, the parsing overhead of the text is typically dozens of times times higher than the binary format, especially XML and JSON, which are more expensive to parse than textfile, so it is strongly not recommended to use these formats for storage in production systems. If you need to output these formats, do the appropriate conversion operations on the client. Text formatting is often used for log collection, database import, hive default configuration is also used in text format, and often easy to forget the compression, so make sure to use the correct format. Another disadvantage of the text format is that it does not have types and patterns, such as sales amount, profit value data or date time type data, if the use of text format to save, because of their own string type of different length, or contain negative numbers, resulting in Mr No way to sort, Therefore, it is often necessary to preprocess them into binary formats that contain patterns, which leads to unnecessary preprocessing steps and wasted storage resources.

(5) External format

Hadoop actually supports any file format, as long as the corresponding Recordwriter and Recordreader can be implemented. The database format is also often stored in Hadoop, such as Hbase,mysql,cassandra,mongodb. These formats are typically used to avoid the need for large amounts of data movement and fast loading. Their serialization and deserialization are done by clients in these database formats, and the file's storage location and data layout are not controlled by Hadoop, and their file slices are not cut by the HDFs block size (blocksize).

2, why need Rcfile

Facebook introduced Data Warehouse hive at the ICDE (IEEE International Conference on data Engineering) conference. Hive stores massive amounts of data in a Hadoop system, providing a set of data storage and processing mechanisms for the class database. It uses the class SQL language to automate the management and processing of data, through statement parsing and transformation, resulting in the generation of Hadoop-based mapreduce tasks that perform these tasks to complete data processing. Shows the system structure of the hive Data Warehouse.

The challenge of storage scalability that Facebook encounters on the Data Warehouse is unique. They store more than 300PB of data in a hive-based data warehouse and grow at a new 600TB per day. The amount of data stored in this data warehouse increased 3 times times last year. With this growth trend in mind, storage efficiency is one of the most interesting aspects of the Facebook Data Warehouse infrastructure that is currently and will be in the future. The rcfile:a Fast and spaceefficient data Placement Structure in mapreducebased Warehouse Systems, published by Facebook engineers, describes an efficient data The storage structure--rcfile (Record columnar File) and applies it to Facebook's Data Warehouse hive. Compared to the data storage structure of traditional databases, rcfile more efficiently meets the four key requirements of a mapreduce-based data warehouse, namely fast data loading, fast query processing, highly efficient Storage space utilization and strong adaptivity to highly dynamic workload patterns. Rcfile is widely used in the data analysis System hive of the Facebook company. First, the Rcfile has the equivalent of the data loading speed and load adaptability of the row storage, secondly, the Rcfile read optimization can avoid unnecessary column reading when scanning the table, the test shows that in most cases, it has better performance than other structures; again, Rcfile uses the compression of the column dimension, Therefore, the storage space utilization can be improved effectively.
To improve storage space utilization, data generated by Facebook's product line applications have been stored in the Rcfile structure since 2010, and datasets stored in a row-store (SEQUENCEFILE/TEXTFILE) structure are also dumped in rcfile format. In addition, Yahoo has integrated rcfile,rcfile in the Pig data Analysis system for another Hadoop-based data management system Howl (Http://wiki.apache.org/pig/Howl). Furthermore, Rcfile has successfully integrated into other MapReduce-based data analysis platforms, based on the communication of the hive development community. It is reasonable to believe that rcfile, as a data storage standard, will continue to play an important role in large-scale data analysis in a mapreduce environment.

3, Rcfile Introduction

The first storage format used when data in a Facebook data warehouse is loaded into a table is the Record-columnar File format (rcfile) that Facebook developed itself. Rcfile is a hybrid Columnstore format that allows query by row, which provides the compression efficiency of column storage. Its core idea is to divide the hive table horizontally into multiple row groups (row groups), and then slice vertically within the group so that columns and columns of data are contiguous blocks of storage on disk.
When all columns within a row group are written to disk, Rcfile compresses the data using a similar ZLIB/LZO algorithm in column units. Lazy decompression Policy (lazy decompression) is used when reading column data, which means that if a user's query involves only a subset of the columns in a table, Rcfile skips the process of extracting and deserializing the columns that are not needed. By choosing a representative example experiment in the Facebook Data Warehouse, Rcfile can provide 5 times times the compression ratio.

4, Beyond Rcfile, the next step to adopt what method

As the amount of data stored in the Data warehouse continues to grow, engineers in the FB group begin to study techniques and methods for improving compression efficiency. The focus of the study is on column-level coding methods, such as stroke length encoding (run-length encoding), Dictionary encoding (Dictionary encoding), reference frame encoding (frame of reference encoding), The ability to reduce the number of logical redundancy at the column level before the general compression process is a numeric encoding method. FB has also tried out new column types (for example, JSON is a widely used format within Facebook, which stores JSON-formatted data in a structured way to meet the demands of efficient queries while also reducing the redundancy of JSON metadata storage). FB experiments show that column-level coding can significantly improve the compression ratio of rcfile if used properly.
At the same time, Hortonworks is experimenting with similar ideas to improve the storage format of hive. Hortonworks's engineering team designed and implemented Orcfile (including storage formats and read-write interfaces), which helped to provide a good start to the design of the Facebook data Warehouse and the implementation of new storage formats.

For Orcfile's introduction, please see here: http://yanbohappy.sinaapp.com/?p=478

On the performance evaluation, the author here for the time being no conditions, affixed to a certain hive technical Summit speakers:

5. How to generate Rcfile files

It says so much, presumably you already know rcfile is mainly used to improve the efficiency of hive query, how to generate files of this format?

(1) Insert conversion via Textfile table directly in hive

For example:

Insert table http_rctable partition (dt='2013-09-30'selectfrom  where dt='2013-09-30';
(2) generated by MapReduce

So far, MapReduce has not provided the built-in API to support Rcfile, but other projects in the Hadoop ecosystem such as pig, Hive, and Hcatalog have been supported because Rcfile compared to other file formats such as Textfile, There is no significant advantage in MapReduce's application scenario.

In order to avoid repeating the wheel, the following generated Rcfile's MapReduce code calls hive and Hcatalog related classes, note that you test the following code, your Hadoop, hive, Hcatalog version to be consistent, otherwise ... You know...

For example, I use the hive-0.10.0+198-1.cdh4.4.0, then you should download the corresponding version: http://archive.cloudera.com/cdh4/cdh/4/

PS: The following code has been tested through, the wood has a problem.

Importjava.io.IOException;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;Importorg.apache.hadoop.hive.serde2.columnar.BytesRefWritable;Importorg.apache.hadoop.io.NullWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;ImportOrg.apache.hcatalog.rcfile.RCFileMapReduceInputFormat;ImportOrg.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat; Public classTexttorcfileextendsConfiguredImplementstool{ Public Static classMapextendsMapper<object, Text, nullwritable, bytesrefarraywritable>{                Private byte[] fieldData; Private intNumcols; Privatebytesrefarraywritable bytes; @Overrideprotected voidSetup (Context context)throwsIOException, interruptedexception {numcols= Context.getconfiguration (). GetInt ("hive.io.rcfile.column.number.conf", 0); Bytes=Newbytesrefarraywritable (Numcols); }                 Public voidmap (Object key, Text line, context context)throwsIOException, interruptedexception {bytes.clear (); String[] cols= Line.tostring (). Split ("\\|"); System.out.println ("SIZE:" +cols.length);  for(inti=0; i<numcols; i++) {FieldData= Cols[i].getbytes ("UTF-8"); Bytesrefwritable cu=NULL; Cu=NewBytesrefwritable (fieldData, 0, fielddata.length);            Bytes.set (i, CU);        } context.write (Nullwritable.get (), bytes); }} @Override Public intRun (string[] args)throwsException {Configuration conf=NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length < 2) {System.out.println ("Usage:" + "Hadoop jar Rcfileloader.jar <main class>" + "-tablename <tabl ename>-numcols <numberOfColumns>-input <input path> "+"-output <output path>- Rowgroupsize <rowGroupSize>-iobuffersize <ioBufferSize> "); System.out.println ("For Test"); System.out.println ("$HADOOP jar Rcfileloader.jar edu.osu.cse.rsam.rcfile.mapreduce.LoadTable" + "-tablename test1-numcol s 10-input rcfileloadertest/test1 "+"-output rcfileloadertest/rcfile_test1 "); System.out.println ("$HADOOP jar Rcfileloader.jar edu.osu.cse.rsam.rcfile.mapreduce.LoadTable" + "-tablename test2-numcol s 5-input rcfileloadertest/test2 "+"-output rcfileloadertest/rcfile_test2 "); return2; }                /*For Test*/String TableName= ""; intNumcols = 0; String InputPath= ""; String OutputPath= ""; intRowgroupsize = 16 *1024*1024; intIobuffersize = 128*1024;  for(inti=0; i<otherargs.length-1; i++){            if("-tablename". Equals (Otherargs[i])) {TableName= Otherargs[i+1]; }Else if("-numcols". Equals (Otherargs[i])) {Numcols= Integer.parseint (otherargs[i+1]); }Else if("-input". Equals (Otherargs[i])) {InputPath= Otherargs[i+1]; }Else if("-output". Equals (Otherargs[i])) {OutputPath= Otherargs[i+1]; }Else if("-rowgroupsize". Equals (Otherargs[i])) {Rowgroupsize= Integer.parseint (otherargs[i+1]); }Else if("-iobuffersize". Equals (Otherargs[i])) {Iobuffersize= Integer.parseint (otherargs[i+1]); }} conf.setint ("Hive.io.rcfile.record.buffer.size", rowgroupsize); Conf.setint ("Io.file.buffer.size", iobuffersize); Job Job=NewJob (conf, "rcfile loader:loading table" + TableName + "with" + Numcols + "columns"); Job.setjarbyclass (texttorcfile.class); Job.setmapperclass (Map.class); Job.setmapoutputkeyclass (nullwritable.class); Job.setmapoutputvalueclass (bytesrefarraywritable.class);//job.setnumreducetasks (0);Fileinputformat.addinputpath (Job,NewPath (InputPath)); Job.setoutputformatclass (Rcfilemapreduceoutputformat.class);        Rcfilemapreduceoutputformat.setcolumnnumber (Job.getconfiguration (), numcols); Rcfilemapreduceoutputformat.setoutputpath (Job,NewPath (OutputPath)); Rcfilemapreduceoutputformat.setcompressoutput (Job,false); System.out.println ("Loading table" + TableName + "from" + InputPath + "to rcfile located at" +OutputPath); System.out.println ("Number of columns:" + job.getconfiguration (). Get ("hive.io.rcfile.column.number.conf")); System.out.println ("Rcfile Row Group size:" + job.getconfiguration (). Get ("Hive.io.rcfile.record.buffer.size")); System.out.println ("Io bufer Size:" + job.getconfiguration (). Get ("Io.file.buffer.size")); return(Job.waitforcompletion (true) ? 0:1); }         Public Static voidMain (string[] args)throwsException {intres = Toolrunner.run (NewConfiguration (),Newtexttorcfile (), args);    System.exit (RES); }}
6, Refer:

(1) Analysis of Hadoop file format Http://www.infoq.com/cn/articles/hadoop-file-format

(2) Facebook Data Warehouse disclosure: Rcfile efficient storage structure http://www.csdn.net/article/2011-04-29/296900

(3) How the Facebook data Warehouse is extended to 300PB http://yanbohappy.sinaapp.com/?p=478

(4) Hive Architecture http://www.jdon.com/bigdata/hive.html

(5) HIVE:ORC File Format storage format detailed http://www.iteblog.com/archives/1014

(6) Ordinary text compression into Rcfile general class Https://github.com/ysmart-xx/ysmart/blob/master/javatest/TextToRCFile.java

http://hugh-wangp.iteye.com/blog/1405804 map reduce code writing based on the hive file format
http://smallboby.iteye.com/blog/1596776 general text compression into a generic class of Rcfile
http://smallboby.iteye.com/blog/1592531 Rcfile Storage and read operations
Https://github.com/kevinweil/elephant-bird/blob/master/rcfile/src/main/java/com/twitter/elephantbird/mapreduce /output/rcfileoutputformat.java

http://blog.csdn.net/liuzhoulong/article/details/7909863

Hive-based file format: Rcfile introduction and its application

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.